JP2015018079A

JP2015018079A - Subtitle voice generation apparatus

Info

Publication number: JP2015018079A
Application number: JP2013144500A
Authority: JP
Inventors: 順長尾; Jun Nagao
Original assignee: Funai Electric Co Ltd
Current assignee: Funai Electric Co Ltd
Priority date: 2013-07-10
Filing date: 2013-07-10
Publication date: 2015-01-29

Abstract

PROBLEM TO BE SOLVED: To provide a subtitle voice generation apparatus which can generate subtitle voice alleviating discomfort to a user.SOLUTION: A subtitle voice generation apparatus includes a voice analysis part which analyzes a condition of a manner of speaking of a person on the basis of voice data to be input, and a synthetic voice generation part which generates subtitle voice being synthetic voice on the basis of subtitle data corresponding to the voice data and the analysis result by the voice analysis part.

Description

本発明は、字幕音声を生成する装置に関する。 The present invention relates to an apparatus for generating caption audio.

従来、放送信号等に含まれる字幕の文字列を音声合成し、字幕音声を生成する字幕音声生成装置が知られている。 2. Description of the Related Art Conventionally, there is known a caption audio generation device that generates a caption audio by synthesizing a subtitle character string included in a broadcast signal or the like.

例えば、特許文献１には、映像信号から字幕部分を抽出し、抽出された字幕部分に含まれる文字列を文字認識し、文字認識された字幕の文字列を音声合成し、字幕音声をスピーカから出力させる字幕抽出装置が開示されている。 For example, in Patent Document 1, a subtitle portion is extracted from a video signal, a character string included in the extracted subtitle portion is character-recognized, a character string of the character-recognized subtitle is voice-synthesized, and subtitle audio is transmitted from a speaker. A caption extraction device for output is disclosed.

この字幕抽出装置では、入力される音声信号から音質の特徴を分析し、この音質に最も近い音質を音声データベースから選択し音声合成する。例えば、外国映画で日本語の字幕が表示されるときに出力される音声が女優の声である場合、その声の音質の特徴から女性の音質で音声合成が行われる。 In this caption extraction device, the characteristics of the sound quality are analyzed from the input sound signal, and the sound quality closest to the sound quality is selected from the sound database and synthesized. For example, when the voice output when a Japanese subtitle is displayed in a foreign movie is an actress's voice, voice synthesis is performed with the female voice quality based on the voice quality characteristics of the voice.

これにより、無機質になりがちな音声合成された音声に、多少なりとも個性を持たせることができるとしている。 As a result, the synthesized speech that tends to be inorganic can be given some individuality.

特開２００３−３３３４４５号公報（第５頁）JP 2003-333445 A (page 5)

しかしながら、映画等で人物は話すスピードや声の強弱などを変えながら話すのが通常であるが、上記特許文献１では音声合成の際にこの点が考慮されておらず、字幕音声を聞く視聴者にとって違和感が生じる。 However, in a movie or the like, it is normal for a person to speak while changing the speaking speed, voice strength, etc. However, in the above-mentioned Patent Document 1, this point is not taken into account when synthesizing speech, and a viewer who listens to subtitle speech. A sense of incongruity.

上記問題点に鑑み、本発明は、ユーザに与える違和感を軽減する字幕音声を生成することが可能となる字幕音声生成装置を提供することを目的とする。 In view of the above problems, an object of the present invention is to provide a caption audio generation device that can generate caption audio that reduces a sense of discomfort given to a user.

上記目的を達成するために本発明の字幕音声生成装置は、入力される音声データに基づき、人物の話し方の状態を解析する音声解析部と、前記音声データに対応する字幕データと前記音声解析部による解析結果に基づき、合成音声である字幕音声を生成する合成音声生成部と、を備える構成としている。 In order to achieve the above object, a subtitle audio generating apparatus according to the present invention includes an audio analysis unit that analyzes a state of speech of a person based on input audio data, subtitle data corresponding to the audio data, and the audio analysis unit. And a synthesized speech generation unit that generates subtitle speech that is synthesized speech based on the analysis result of the above.

このような構成によれば、人物の話し方の状態を字幕音声に反映させることができるので、ユーザに与える違和感を軽減する字幕音声を生成することができる。 According to such a configuration, the state of the person's speaking can be reflected in the subtitle sound, so that it is possible to generate subtitle sound that reduces a sense of discomfort given to the user.

また、上記構成において、前記人物の話し方の状態は、声のスピード及び／又は声の強弱であることとしてもよい。 In the above-described configuration, the person's speaking state may be voice speed and / or voice strength.

このような構成によれば、声のスピード及び／又は声の強弱を字幕音声に反映させることができる。特に、声のスピードを字幕音声に反映させると、映像における人物の口の動きと字幕音声とのずれを抑えることができ、映像を見ているユーザにとって違和感を抑えることができる。 According to such a configuration, the speed of the voice and / or the strength of the voice can be reflected in the subtitle sound. In particular, when the speed of the voice is reflected in the caption audio, it is possible to suppress the difference between the movement of the mouth of the person and the caption audio in the video, and it is possible to suppress a sense of discomfort for the user watching the video.

また、上記いずれかの構成において、前記音声解析部は、前記音声データに基づき複数の人物を検出し、前記合成音声生成部は、予め記憶部上に格納されて準備された複数の合成音声パターンから前記検出された複数の人物の各人に前記合成音声パターンを割り当てて前記字幕音声を生成することとしてもよい。 In any one of the above-described configurations, the speech analysis unit detects a plurality of persons based on the speech data, and the synthesized speech generation unit stores a plurality of synthesized speech patterns that are stored and prepared in advance in a storage unit. The subtitle sound may be generated by assigning the synthesized sound pattern to each of the detected plurality of persons.

このような構成によれば、複数の人物が登場する場合に、複数の人物が話しているかのような字幕音声を生成でき、ユーザにとってより違和感を抑えることができる。また、予め準備された複数の合成音声パターンを割り当てるので、字幕音声を速く生成することができる。 According to such a configuration, when a plurality of persons appear, it is possible to generate subtitle sound as if the plurality of persons are speaking, and it is possible to further suppress a sense of discomfort for the user. In addition, since a plurality of synthesized speech patterns prepared in advance are assigned, subtitle speech can be generated quickly.

また、上記構成において、前記合成音声生成部は、番組情報に基づき前記複数の合成音声パターンを予め前記記憶部上に格納させて準備することとしてもよい。 In the above configuration, the synthesized speech generation unit may prepare the plurality of synthesized speech patterns in advance on the storage unit based on program information.

このような構成によれば、複数の合成音声パターンを番組に応じて必要なだけ準備することができ、必要以上に準備することを抑えることができる。 According to such a configuration, a plurality of synthesized voice patterns can be prepared as necessary according to the program, and preparation more than necessary can be suppressed.

また、前記複数の合成音声パターンは、性別ごとに複数準備された声質の異なる合成音声パターンであることとしてもよい。 Further, the plurality of synthesized speech patterns may be synthesized speech patterns having different voice qualities prepared for each gender.

このような構成によれば、複数の性別の異なる人物が登場する場合に、複数の性別の異なる人物が話しているかのような字幕音声を生成することができる。 According to such a configuration, when a plurality of persons with different genders appear, subtitle sound can be generated as if a plurality of persons with different genders are talking.

また、上記いずれかの構成において、前記合成音声生成部は、音声が無いことを検出したとき、又は映像に人物が映っていないことを検出したとき、前記字幕データに基づき単調な前記字幕音声を生成することとしてもよい。 In any one of the configurations described above, when the synthesized sound generation unit detects that there is no sound or detects that a person is not shown in the video, the synthesized sound generating unit outputs the monotonous subtitle sound based on the subtitle data. It may be generated.

このような構成によれば、音声が無い場合、又は映像に人物が映っていない場合に、ナレーションのような字幕音声を生成することができる。 According to such a configuration, it is possible to generate caption audio such as narration when there is no audio or when a person is not shown in the video.

また、上記いずれかの構成において、前記音声データに基づく音声を内蔵スピーカから発生させると共に、前記字幕音声に基づく音声を外部出力端子から出力させることとしてもよい。 In any of the above-described configurations, sound based on the sound data may be generated from a built-in speaker, and sound based on the subtitle sound may be output from an external output terminal.

このような構成によれば、例えば外国語で音声を聞きたいユーザは内蔵スピーカから音声を聞き、自国語で音声を聞きたいユーザは外部出力端子から出力される字幕音声を聞くことができる。 According to such a configuration, for example, a user who wants to hear a sound in a foreign language can hear the sound from the built-in speaker, and a user who wants to hear the sound in his / her own language can hear the subtitle sound output from the external output terminal.

また、上記いずれかの構成において、前記音声データに基づく音声を内蔵スピーカから発生させると共に、前記字幕音声を無線信号を用いて外部のモバイル機器に送信させることとしてもよい。 In any of the above configurations, sound based on the sound data may be generated from a built-in speaker, and the subtitle sound may be transmitted to an external mobile device using a radio signal.

このような構成によれば、例えば外国語で音声を聞きたいユーザは内蔵スピーカから音声を聞き、自国語で音声を聞きたいユーザは手元のモバイル機器から出力される字幕音声を聞くことができる。 According to such a configuration, for example, a user who wants to hear a sound in a foreign language can hear the sound from the built-in speaker, and a user who wants to hear the sound in his / her own language can hear the subtitle sound output from the mobile device at hand.

本発明によると、ユーザに与える違和感を軽減する字幕音声を生成することが可能となる。 According to the present invention, it is possible to generate subtitle sound that reduces a sense of discomfort given to a user.

本発明の第１実施形態に係るテレビ装置の概略構成を示すブロック図である。It is a block diagram which shows schematic structure of the television apparatus which concerns on 1st Embodiment of this invention. 本発明の実施形態に係る音声解析処理に関するフローチャートである。It is a flowchart regarding the audio | voice analysis process which concerns on embodiment of this invention. 本発明の実施形態に係る複数の合成音声パターンの一例を示す図である。It is a figure which shows an example of the some synthetic speech pattern which concerns on embodiment of this invention. 本発明の第２実施形態に係るテレビ装置の概略構成を示すブロック図である。It is a block diagram which shows schematic structure of the television apparatus which concerns on 2nd Embodiment of this invention. 本発明の第３実施形態に係るテレビ装置の概略構成を示すブロック図である。It is a block diagram which shows schematic structure of the television apparatus which concerns on 3rd Embodiment of this invention. 本発明の第４実施形態に係るテレビ装置の概略構成を示すブロック図である。It is a block diagram which shows schematic structure of the television apparatus which concerns on 4th Embodiment of this invention.

＜第１実施形態＞
以下に本発明の一実施形態について図面を参照して説明する。以下では、字幕音声生成装置として、テレビ装置を一例に挙げて説明する。本発明の第１実施形態に係るテレビ装置の概略構成を示すブロック図を図１に示す。図１に示すテレビ装置１は、チューナ１１と、復調部１２と、分離部１３と、映像デコーダ１４と、データデコーダ１５と、音声デコーダ１６と、映像出力部１７と、表示部１８と、ＯＳＤ（オンスクリーンディスプレイ）部１９と、音声解析部２０と、合成音声生成部２１と、音声出力部２２と、スピーカ２３を備えている。チューナ１１には、アンテナ２が接続される。 <First Embodiment>
An embodiment of the present invention will be described below with reference to the drawings. Hereinafter, a television device will be described as an example of the caption audio generation device. FIG. 1 is a block diagram showing a schematic configuration of the television apparatus according to the first embodiment of the present invention. 1 includes a tuner 11, a demodulator 12, a separator 13, a video decoder 14, a data decoder 15, an audio decoder 16, a video output unit 17, a display unit 18, and an OSD. An (on-screen display) unit 19, a voice analysis unit 20, a synthesized voice generation unit 21, a voice output unit 22, and a speaker 23 are provided. The antenna 11 is connected to the tuner 11.

チューナ１１は、例えば、地上デジタル放送、ＢＳデジタル放送、ＣＳデジタル放送の少なくともいずれかに対応するものであり、アンテナ２から入力された高周波の放送信号から所望のチャンネルの放送信号を選局する。 The tuner 11 corresponds to, for example, at least one of terrestrial digital broadcast, BS digital broadcast, and CS digital broadcast, and selects a broadcast signal of a desired channel from a high-frequency broadcast signal input from the antenna 2.

復調部１２は、チューナ１１において選局されたチャンネルの放送信号に対してデジタル復調及び誤り訂正などの処理を行い、トランスポートストリームを生成し、分離部１３へ出力する。 The demodulator 12 performs processing such as digital demodulation and error correction on the broadcast signal of the channel selected by the tuner 11, generates a transport stream, and outputs the transport stream to the separator 13.

分離部（デマルチプレクサ）１３は、復調部１２から入力されるトランスポートストリームを映像ストリーム、音声ストリーム、字幕データ等に分離する。 The separation unit (demultiplexer) 13 separates the transport stream input from the demodulation unit 12 into a video stream, an audio stream, caption data, and the like.

映像デコーダ１４は、分離部１３から入力される映像ストリームに対してデコードを行い、生成された映像データを映像出力部１７に出力する。 The video decoder 14 decodes the video stream input from the separation unit 13 and outputs the generated video data to the video output unit 17.

データデコーダ１５は、分離部１３から入力される字幕データに対してデコードを行い、生成された字幕テキストデータをＯＳＤ部１９へ出力する。 The data decoder 15 decodes the caption data input from the separation unit 13 and outputs the generated caption text data to the OSD unit 19.

ＯＳＤ部１９は、メニュー表示などのオンスクリーンディスプレイ用の表示データを生成し、映像出力部１７に出力する。また、ＯＳＤ部１９は、データデコーダ１５から入力される字幕テキストデータに基づき字幕表示データを生成することも可能であり、字幕表示データを映像出力部１７に出力する。 The OSD unit 19 generates display data for on-screen display such as menu display and outputs the generated display data to the video output unit 17. The OSD unit 19 can also generate caption display data based on the caption text data input from the data decoder 15, and outputs the caption display data to the video output unit 17.

映像出力部１７は、映像デコーダ１４から入力される映像データにＯＳＤ部１９から入力される表示データを重畳し、重畳後の映像データを表示部１８に適した映像信号に変換し、表示部１８に出力する。なお、重畳せずに映像デコーダ１４から入力される映像データか、ＯＳＤ部１９から入力される表示データのいずれかのみを映像信号に変換して表示部１８に出力する場合もある。 The video output unit 17 superimposes the display data input from the OSD unit 19 on the video data input from the video decoder 14, converts the superimposed video data into a video signal suitable for the display unit 18, and displays the display unit 18. Output to. In some cases, only video data input from the video decoder 14 without being superimposed or display data input from the OSD unit 19 is converted into a video signal and output to the display unit 18.

表示部１８は、例えば液晶ディスプレイであり、映像出力部１７から入力される映像信号に基づき映像を表示する。これにより、字幕を含んだ放送番組の映像や、メニュー画面などの各種映像が表示部１８に表示される。 The display unit 18 is, for example, a liquid crystal display, and displays a video based on the video signal input from the video output unit 17. As a result, video of a broadcast program including subtitles and various videos such as a menu screen are displayed on the display unit 18.

音声デコーダ１６は、分離部１３から入力される音声ストリームに対してデコードを行い、生成された音声データを音声出力部２２へ出力する。音声出力部２２は、音声デコーダ１６から入力される音声データをスピーカ２３に適した音声信号に変換し、スピーカ２３へ出力する。スピーカ２３は、音声出力部２２から入力される音声信号に基づき音声を発生させる。これにより、スピーカ２３から放送番組の音声が発生する。 The audio decoder 16 decodes the audio stream input from the separation unit 13 and outputs the generated audio data to the audio output unit 22. The audio output unit 22 converts the audio data input from the audio decoder 16 into an audio signal suitable for the speaker 23 and outputs the audio signal to the speaker 23. The speaker 23 generates sound based on the audio signal input from the audio output unit 22. Thereby, the sound of the broadcast program is generated from the speaker 23.

また、音声デコーダ１６は、生成された音声データを音声解析部２０へ出力することも可能である。音声解析部２０は、音声デコーダ１６から入力される音声データを解析し、その解析結果を音声合成部２１へ通知する。 The audio decoder 16 can also output the generated audio data to the audio analysis unit 20. The voice analysis unit 20 analyzes the voice data input from the voice decoder 16 and notifies the voice synthesis unit 21 of the analysis result.

合成音声生成部２１は、データデコーダ１５から入力される字幕テキストデータと、音声解析部２０から通知された音声の解析結果に基づき合成音声である字幕音声データを生成し音声出力部２２へ出力する。この場合、音声出力部２２は、合成音声生成部２１から入力される字幕音声データをスピーカ２３に適した音声信号に変換してスピーカ２３に出力する。これにより、スピーカ２３から字幕音声が発生する。 The synthesized voice generation unit 21 generates subtitle voice data that is a synthesized voice based on the caption text data input from the data decoder 15 and the analysis result of the voice notified from the voice analysis unit 20, and outputs it to the voice output unit 22. . In this case, the audio output unit 22 converts the subtitle audio data input from the synthesized audio generation unit 21 into an audio signal suitable for the speaker 23 and outputs the audio signal to the speaker 23. Thereby, subtitle sound is generated from the speaker 23.

次に、本実施形態に係るテレビ装置１における字幕音声出力動作について、より詳細に説明する。 Next, the caption audio output operation in the television apparatus 1 according to the present embodiment will be described in more detail.

例えば、或る放送番組の音声が英語によるもののみであり、字幕は日本語と英語によるものであるとする。この場合、通常モードとしては、映像デコーダ１４から出力される映像データに、データデコーダ１５から出力される日本語による字幕テキストデータ（字幕テキストデータの言語は選択可能）に基づきＯＳＤ部１９から出力される字幕表示データが映像出力部１７で重畳され、表示部１８に日本語字幕を含んだ放送映像が表示される（なお、字幕を表示させない選択も可能である）。この通常モードの場合、音声デコーダ１６から英語による音声データが音声出力部２２に出力され、スピーカ２３からは英語による放送音声が発生する。 For example, it is assumed that the sound of a certain broadcast program is only in English, and the subtitles are in Japanese and English. In this case, as the normal mode, the video data output from the video decoder 14 is output from the OSD unit 19 based on the Japanese caption text data output from the data decoder 15 (the language of the caption text data is selectable). The subtitle display data is superimposed on the video output unit 17, and a broadcast video including Japanese subtitles is displayed on the display unit 18 (selection not to display subtitles is also possible). In the normal mode, audio data in English is output from the audio decoder 16 to the audio output unit 22, and broadcast audio in English is generated from the speaker 23.

英語による音声で放送を視聴したい場合は、上記通常モードで視聴すればよいが、日本語による音声で視聴したい場合は、下記で説明する字幕音声出力モードに移行する。字幕音声出力モードへの移行は、例えば、リモコン装置（不図示）による操作に応じてテレビ装置１の制御部（不図示）が行う。字幕音声出力モードに移行すると、音声デコーダ１６は、音声データを音声出力部２２には出力せず、音声解析部２０のみに出力する。これにより、放送信号の音声データによる音声（上記の場合であれば英語による音声）はスピーカ２３から出力されない。なお、字幕音声出力モードでは、映像表示に関しては上記通常モードと同様である。 If you want to watch the broadcast in English, you can watch it in the normal mode, but if you want to watch in Japanese, you can move to the subtitle audio output mode described below. The transition to the caption audio output mode is performed by, for example, a control unit (not shown) of the television device 1 according to an operation by a remote control device (not shown). When the subtitle audio output mode is entered, the audio decoder 16 does not output audio data to the audio output unit 22 but outputs only to the audio analysis unit 20. Thereby, the sound by the sound data of the broadcast signal (the sound in English in the above case) is not output from the speaker 23. In the caption audio output mode, the video display is the same as in the normal mode.

ここで、音声解析部２０による解析処理に関して、図２に示すフローチャートを用いて説明する。なお、音声解析部２０には不図示のバッファが備えられ、音声デコーダ１６から入力されて上記バッファに蓄えられた音声データに対して音声解析部２０は解析を行う。また、図２の処理は繰り返し行われる。 Here, analysis processing by the voice analysis unit 20 will be described with reference to a flowchart shown in FIG. The voice analysis unit 20 includes a buffer (not shown), and the voice analysis unit 20 analyzes the voice data input from the voice decoder 16 and stored in the buffer. Further, the process of FIG. 2 is repeatedly performed.

図２に示すフローチャートが開始されると、まずステップＳ１で、音声解析部２０は、解析対象の音声データから例えば音声の周波数特徴などの音声特徴を取得する。 When the flowchart shown in FIG. 2 is started, first, in step S1, the speech analysis unit 20 acquires speech features such as frequency features of speech from the speech data to be analyzed.

次に、ステップＳ２で、音声解析部２０は、ステップＳ１で取得された音声特徴が過去に取得された音声特徴と一致するか否かを判定する。 Next, in step S2, the voice analysis unit 20 determines whether the voice feature acquired in step S1 matches the voice feature acquired in the past.

もしステップＳ２で音声特徴が一致しない場合は（ステップＳ２のＮ）、ステップＳ３へ進み、音声解析部２０は、解析対象の音声データに基づき男性の声であるか女性の声であるかの判別を行う。 If the voice features do not match in step S2 (N in step S2), the process proceeds to step S3, and the voice analysis unit 20 determines whether the voice is a male voice or a female voice based on the voice data to be analyzed. I do.

ステップＳ３の後、ステップＳ４で、音声解析部２０は、解析対象の音声データに基づき声のスピード、及び声の強弱を検出する。 After step S3, in step S4, the voice analysis unit 20 detects the speed of the voice and the strength of the voice based on the voice data to be analyzed.

そして、ステップＳ５で、音声解析部２０は、男性または女性の新たに検出された人物をステップＳ１で取得された音声特徴に対応付けて登録する。ここでは、ステップＳ３で判別された性別の結果に応じて登録することとなる。例えば、男性の声と判別された場合は、登録順に男性Ａ、Ｂ、Ｃ・・・など、女性の声と判別された場合は、登録順に女性Ａ、Ｂ、Ｃ・・・などと登録する。なお、ステップＳ２で判定するための過去に取得した音声特徴とは、ここで登録された音声特徴のことである。 Then, in step S5, the voice analysis unit 20 registers the newly detected person of male or female in association with the voice feature acquired in step S1. Here, registration is performed according to the sex result determined in step S3. For example, when it is determined that the voice is male, males A, B, C,... Are registered in the order of registration, and when it is determined that the voice is female, females A, B, C,. . Note that the voice feature acquired in the past for determination in step S2 is a voice feature registered here.

そして、ステップＳ６で、音声解析部２０は、ステップＳ５で新たに登録された人物、及びステップＳ４で検出された声のスピード及び強弱を合成音声生成部２１に通知する。 In step S6, the voice analysis unit 20 notifies the synthesized voice generation unit 21 of the person newly registered in step S5 and the speed and strength of the voice detected in step S4.

また、ステップＳ２で、音声特徴が一致した場合は（ステップＳ２のＹ）、ステップＳ７へ進み、音声解析部２０は、解析対象の音声データに基づき声のスピード、及び声の強弱を検出する。そして、ステップＳ８で、音声解析部２０は、ステップＳ１で取得された音声特徴と一致した音声特徴に対応する人物（即ち過去に検出された人物）、及びステップＳ７で検出された声のスピード及び強弱を合成音声生成部２１に通知する。 If the voice features match in step S2 (Y in step S2), the process proceeds to step S7, where the voice analysis unit 20 detects the speed of the voice and the strength of the voice based on the voice data to be analyzed. Then, in step S8, the voice analysis unit 20 determines the person corresponding to the voice feature that matches the voice feature acquired in step S1 (that is, the person detected in the past), and the speed of the voice detected in step S7. The synthesized voice generation unit 21 is notified of the strength.

ステップＳ６またはステップＳ８の後、処理は完了となる（エンド）。 After step S6 or step S8, the process is completed (end).

そして、合成音声生成部２１は、ステップＳ６またはステップＳ８で音声解析部２０から通知された解析結果と、データデコーダ１５から入力される字幕テキストデータに基づき、合成音声である字幕音声データを生成する。 Then, the synthesized speech generation unit 21 generates caption speech data that is synthesized speech based on the analysis result notified from the speech analysis unit 20 in step S6 or step S8 and the caption text data input from the data decoder 15. .

より具体的には、合成音声生成部２１は、例えば図３に示すように、男女別に声質の異なる複数の合成音声パターンを予め記憶部（不図示）上に格納して準備しておき、音声解析部２０から通知された人物に応じた合成音声パターンを選択し、字幕音声データの生成に使用する。例えば、音声解析部２０から通知された人物が「男性Ｃ」であれば、図３の「男声Ｃ」の合成音声パターンを選択する等である。 More specifically, as shown in FIG. 3, for example, the synthesized speech generation unit 21 stores and prepares a plurality of synthesized speech patterns having different voice qualities for each gender in advance in a storage unit (not shown). A synthesized speech pattern corresponding to the person notified from the analysis unit 20 is selected and used to generate caption audio data. For example, if the person notified from the voice analysis unit 20 is “male C”, the synthesized voice pattern “male voice C” in FIG. 3 is selected.

そして、音声出力部２２が合成音声生成部２１から入力される字幕音声データをスピーカ２３に適した音声信号に変換することで、スピーカ２３から字幕音声が発生する。 Then, the audio output unit 22 converts the subtitle audio data input from the synthesized audio generation unit 21 into an audio signal suitable for the speaker 23, thereby generating subtitle audio from the speaker 23.

このように本実施形態では、テレビ装置１（字幕音声生成装置）は、音声デコーダ１６から入力される音声データに基づき、声のスピード及び強弱（人物の話し方の状態）を解析する音声解析部２０と、上記音声データに対応する字幕テキストデータと音声解析部２０による解析結果に基づき、合成音声である字幕音声を生成する合成音声生成部２１を備える構成としている。 As described above, in the present embodiment, the television apparatus 1 (caption sound generation apparatus) analyzes the voice speed and strength (state of person's speaking) based on the sound data input from the sound decoder 16. And a synthesized speech generation unit 21 that generates captioned speech that is synthesized speech based on the caption text data corresponding to the speech data and the analysis result by the speech analysis unit 20.

これにより、声のスピード及び強弱（人物の話し方の状態）を字幕音声に反映させることができるので、ユーザに与える違和感を軽減する字幕音声を生成することができる。 Thereby, since the speed and strength of the voice (state of how the person speaks) can be reflected in the subtitle sound, it is possible to generate subtitle sound that reduces a sense of discomfort given to the user.

特に、声のスピードを字幕音声に反映させると、映像における人物の口の動きと字幕音声とのずれを抑えることができ、映像を見ているユーザに与える違和感を抑えることができる。 In particular, when the speed of the voice is reflected in the caption audio, it is possible to suppress the difference between the movement of the person's mouth in the image and the caption audio, and to suppress the uncomfortable feeling given to the user watching the image.

また、音声解析部２０は、上記音声データに基づき複数の人物を検出し（図２の処理）、合成音声生成部２１は、予め記憶部上に格納されて準備された複数の合成音声パターン（例えば図３）から上記検出された複数の人物の各人に合成音声パターンを割り当てて字幕音声を生成する。 Also, the voice analysis unit 20 detects a plurality of persons based on the voice data (the process of FIG. 2), and the synthesized voice generation unit 21 stores a plurality of synthesized voice patterns (stored and prepared in advance in the storage unit). For example, from FIG. 3), a synthesized voice pattern is assigned to each of the detected plurality of persons to generate subtitle sound.

これにより、複数の人物が登場する場合に、複数の人物が話しているかのような字幕音声を生成でき、ユーザに与える違和感をより抑えることができる。また、予め準備された複数の合成音声パターンを割り当てるので、字幕音声を速く生成することができる。 Thereby, when a plurality of persons appear, subtitle sound as if the plurality of persons are speaking can be generated, and the uncomfortable feeling given to the user can be further suppressed. In addition, since a plurality of synthesized speech patterns prepared in advance are assigned, subtitle speech can be generated quickly.

また、例えば図３のように、上記複数の合成音声パターンは、性別ごとに複数準備された声質の異なる合成音声パターンであることとしてもよい。 For example, as shown in FIG. 3, the plurality of synthesized speech patterns may be synthesized speech patterns having different voice qualities prepared for each gender.

これにより、複数の性別の異なる人物が登場する場合に、複数の性別の異なる人物が話しているかのような字幕音声を生成することができる。 Thereby, when a plurality of persons with different genders appear, subtitle sound as if a plurality of persons with different genders are speaking can be generated.

＜第２実施形態＞
次に、本発明の第２実施形態について説明する。本発明の第２実施形態に係るテレビ装置の概略構成を示すブロック図を図４に示す。以下、図４に示すテレビ装置１’の上記第１実施形態（図１）との相違点について主に述べる。 Second Embodiment
Next, a second embodiment of the present invention will be described. FIG. 4 is a block diagram showing a schematic configuration of the television apparatus according to the second embodiment of the present invention. Hereinafter, differences from the first embodiment (FIG. 1) of the television apparatus 1 ′ shown in FIG. 4 will be mainly described.

図４に示すテレビ装置１’ は、ネットワークインタフェース２４を備えている。ネットワークインタフェース２４は、インターネット３に接続可能であり、インターネット３を介してサーバ装置４と通信を行う。 The television apparatus 1 ′ illustrated in FIG. 4 includes a network interface 24. The network interface 24 can be connected to the Internet 3 and communicates with the server device 4 via the Internet 3.

データデコーダ１５’は、第１実施形態のように字幕テキストデータを合成音声生成部２１’へ出力する。それと共にデータデコーダ１５’は、分離部１３でトランスポートストリームから分離されるＳＩ（Service Information）に対してデコードを行い、デコード後のＳＩに含まれるＥＰＧ（Electronic Program Guide）情報を合成音声生成部２１’へ出力する。 The data decoder 15 'outputs the caption text data to the synthesized speech generation unit 21' as in the first embodiment. At the same time, the data decoder 15 ′ decodes SI (Service Information) separated from the transport stream by the separation unit 13, and generates EPG (Electronic Program Guide) information included in the decoded SI from the synthesized speech generation unit. To 21 '.

合成音声生成部２１’は、入力されたＥＰＧ情報に含まれる例えば番組詳細情報をネットワークインタフェース２４を用いてインターネット３を介してサーバ装置４へ送る。 The synthesized voice generation unit 21 ′ transmits, for example, detailed program information included in the input EPG information to the server device 4 via the Internet 3 using the network interface 24.

ここで、サーバ装置４には、人物名と性別が対応付けられたデータベースを有している。そして、サーバ装置４は、受け取った番組詳細情報に含まれる番組出演者を上記データベースを用いて検索し、性別ごとの出演者の人数を検出する（例えば、男性１０人、女性８人など）。そして、サーバ装置４は、この検出結果をインターネット３及びネットワークインタフェース２４を介して合成音声生成部２１’に送る。 Here, the server device 4 has a database in which person names and genders are associated. And the server apparatus 4 searches the program performer contained in the received program detailed information using the said database, and detects the number of the performers for every sex (for example, 10 men, 8 women, etc.). Then, the server device 4 sends the detection result to the synthesized speech generation unit 21 ′ via the Internet 3 and the network interface 24.

合成音声生成部２１’は、受け取った検出結果に応じて、性別ごとのパターン数の合成音声パターンを予め記憶部（不図示）上に格納させて準備しておく（例えば、検出結果が男性１０人、女性８人であれば、図３に示す合成音声パターンを男性は１０パターン、女性は８パターン準備するなど）。 The synthesized speech generation unit 21 ′ prepares the synthesized speech pattern having the number of patterns for each gender in advance in a storage unit (not shown) according to the received detection result (for example, the detection result is 10 males). If there are eight people and eight women, the synthetic voice pattern shown in FIG. 3 is prepared for 10 patterns for men and 8 patterns for women).

そして、合成音声生成部２１’は、音声解析部２０から通知された人物に応じて上記準備された合成音声パターンを選択し、字幕音声データの生成に使用する。 Then, the synthesized speech generation unit 21 ′ selects the prepared synthesized speech pattern according to the person notified from the speech analysis unit 20, and uses it for generating caption audio data.

このように本実施形態によれば、合成音声生成部２１’は、ＥＰＧ情報（番組情報）に基づき複数の合成音声パターンを予め準備することとしている。これにより、複数の合成音声パターンを番組に応じて必要なだけ準備することができ、必要以上に準備することを抑えることができる。 As described above, according to the present embodiment, the synthesized speech generation unit 21 ′ prepares a plurality of synthesized speech patterns in advance based on EPG information (program information). As a result, a plurality of synthesized voice patterns can be prepared as necessary according to the program, and preparation more than necessary can be suppressed.

＜第３実施形態＞
次に、本発明の第３実施形態について説明する。本発明の第３実施形態に係るテレビ装置の概略構成を示すブロック図を図５に示す。 <Third Embodiment>
Next, a third embodiment of the present invention will be described. FIG. 5 is a block diagram showing a schematic configuration of the television apparatus according to the third embodiment of the present invention.

図５に示すテレビ装置１’ ’の上記第１実施形態（図１）との相違点は、音声の出力形態である。 The difference between the television apparatus 1 ′ ′ shown in FIG. 5 and the first embodiment (FIG. 1) is the sound output form.

図５に示すテレビ装置１’は、音声出力部２５と、外部出力端子２６と、音声出力部２７と、スピーカ２８を備えている。外部出力端子２６には、ヘッドホン５の端子が脱着可能である。 The television apparatus 1 ′ illustrated in FIG. 5 includes an audio output unit 25, an external output terminal 26, an audio output unit 27, and a speaker 28. A terminal of the headphone 5 can be attached to and detached from the external output terminal 26.

上記第１実施形態（図１）では、字幕音声出力モードに移行すると、音声デコーダ１６は音声出力部２２に音声データを出力せず、字幕音声のみがスピーカ２３から出力されたが、本実施形態では、字幕音声出力モードに移行すると、音声デコーダ１６は音声解析部２０に加え音声出力部２７にも音声データを出力する。 In the first embodiment (FIG. 1), when the subtitle audio output mode is entered, the audio decoder 16 does not output audio data to the audio output unit 22, and only the subtitle audio is output from the speaker 23. In the subtitle audio output mode, the audio decoder 16 outputs audio data to the audio output unit 27 in addition to the audio analysis unit 20.

これにより、合成音声生成部２１は、合成音声である字幕音声を生成して音声出力部２５に出力する。そして、音声出力部２５は、外部出力端子２６を介して、外部出力端子２６に接続されたヘッドホン５に字幕音声の音声信号を出力する。従って、ヘッドホン５からは字幕音声が発生する。 As a result, the synthesized speech generation unit 21 generates subtitle speech that is synthesized speech and outputs it to the audio output unit 25. Then, the audio output unit 25 outputs a subtitle audio signal to the headphones 5 connected to the external output terminal 26 via the external output terminal 26. Accordingly, subtitle sound is generated from the headphones 5.

それと共に、音声出力部２７は、音声デコーダ１６から入力される音声データをスピーカ２８に適した音声信号に変換し、スピーカ２８に出力する。これにより、スピーカ２８からは放送番組の音声が発生する。 At the same time, the audio output unit 27 converts the audio data input from the audio decoder 16 into an audio signal suitable for the speaker 28 and outputs the audio signal to the speaker 28. Thereby, the sound of the broadcast program is generated from the speaker 28.

このような本実施形態によれば、テレビ装置１’ ’のスピーカ２８から放送番組の音声（例えば英語による音声）が発生すると共に、ヘッドホン５から字幕音声（例えば日本語による音声）が発生する。従って、外国語学習をしたいなどで放送番組の音声で聞きたいユーザはスピーカ２８からの音声を聞き、字幕音声を聞きたいユーザはヘッドホン５から発生する音声を聞くことができる。 According to the present embodiment as described above, sound of a broadcast program (for example, sound in English) is generated from the speaker 28 of the television apparatus 1 ′ ′, and subtitle sound (for example, sound in Japanese) is generated from the headphones 5. Therefore, a user who wants to listen to the broadcast program sound, for example, wants to learn a foreign language, can hear the sound from the speaker 28, and a user who wants to hear the subtitle sound can hear the sound generated from the headphones 5.

ヘッドホン５で字幕音声を聞けば、ヘッドホンを使用しているユーザにとってはスピーカ２８からの音声が聞こえることを抑制すると共に、スピーカ２８からの音声を聞いているユーザにとっては字幕音声が聞こえることを抑制できる。 If the headphone 5 listens to the caption sound, the user who uses the headphones suppresses the sound from the speaker 28 from being heard, and the user who listens to the sound from the speaker 28 suppresses the sound of the caption sound from being heard. it can.

また、放送番組の音声の言語が理解できない目の不自由なユーザであっても、ヘッドホン５から発生する字幕音声を聞くことで、放送を鑑賞することができる。 Further, even a blind user who cannot understand the language of the audio of the broadcast program can appreciate the broadcast by listening to the subtitle audio generated from the headphones 5.

なお、本実施形態において、上記第２実施形態のようにインターネット接続可能な構成（図４）を適用することも可能である。 In the present embodiment, it is also possible to apply a configuration (FIG. 4) capable of connecting to the Internet as in the second embodiment.

＜第４実施形態＞
次に、本発明の第４実施形態について説明する。本発明の第４実施形態に係るテレビ装置１’ ’ ’の概略構成を示すブロック図を図６に示す。 <Fourth embodiment>
Next, a fourth embodiment of the present invention will be described. FIG. 6 is a block diagram showing a schematic configuration of a television apparatus 1 ′ ″ ′ according to the fourth embodiment of the present invention.

図６に示すテレビ装置１’ ’ ’の上記第３実施形態（図５）との相違点は、無線通信部３０を備えていることである。 A difference from the third embodiment (FIG. 5) of the television apparatus 1 ′ ′ illustrated in FIG. 6 is that a wireless communication unit 30 is provided.

本実施形態では、字幕音声出力モードにおいては、上記第３実施形態と同様に合成生成部２１から字幕音声データが音声出力部２９に出力されると共に、音声出力部２７からスピーカ２８へ音声信号が出力される。音声出力部２９は、入力される字幕音声データを無線通信部３０に出力する。 In the present embodiment, in the subtitle audio output mode, the subtitle audio data is output from the synthesis generation unit 21 to the audio output unit 29 and the audio signal is output from the audio output unit 27 to the speaker 28 as in the third embodiment. Is output. The audio output unit 29 outputs the input subtitle audio data to the wireless communication unit 30.

無線通信部３０は、例えばBluetoothやWi-Fiなどの規格に対応してモバイル機器６と無線通信を行う。モバイル機器６は、例えばスマートフォンや携帯電話などである。 The wireless communication unit 30 performs wireless communication with the mobile device 6 in accordance with standards such as Bluetooth and Wi-Fi. The mobile device 6 is, for example, a smartphone or a mobile phone.

無線通信部３０は、音声出力部２９から入力された音声データを対応する規格に準じた無線信号によりモバイル機器６へ送信する。これにより、モバイル機器６が有する内蔵スピーカから字幕音声が発生する。 The wireless communication unit 30 transmits the audio data input from the audio output unit 29 to the mobile device 6 using a radio signal conforming to the corresponding standard. Thereby, subtitle sound is generated from the built-in speaker of the mobile device 6.

このような本実施形態によれば、テレビ装置１’ ’ ’のスピーカ２８から放送番組の音声（例えば英語による音声）が発生すると共に、モバイル機器６から字幕音声（例えば日本語による音声）が発生する。従って、外国語学習をしたいなどで放送番組の音声で聞きたいユーザはスピーカ２８からの音声を聞き、字幕音声を聞きたいユーザは手元のモバイル機器６の内蔵スピーカから発生する音声を聞くことができる。特に、テレビ装置１’ ’ ’の設置された部屋と別の部屋においてモバイル機器６から字幕音声を聞くこともできる。 According to the present embodiment as described above, sound of a broadcast program (for example, sound in English) is generated from the speaker 28 of the television apparatus 1 ′ ″, and subtitle sound (for example, sound in Japanese) is generated from the mobile device 6. To do. Accordingly, a user who wants to listen to the broadcast program sound, for example, wants to learn a foreign language, can hear the sound from the speaker 28, and a user who wants to hear the subtitle sound can hear the sound generated from the built-in speaker of the mobile device 6 at hand. . In particular, subtitle sound can be heard from the mobile device 6 in a room different from the room where the television apparatus 1 ′ ′ is installed.

また、放送番組の音声の言語が理解できない目の不自由なユーザであっても、手元のモバイル機器６から発生する字幕音声を聞くことで、放送を鑑賞することができる。また、モバイル機器６に接続されたヘッドホンで音声を聞けば、ヘッドホンを使用しているユーザにとってはスピーカ２８からの音声が聞こえることを抑制すると共に、スピーカ２８からの音声を聞いているユーザにとっては字幕音声が聞こえることを抑制できる。 Moreover, even a blind user who cannot understand the language of the audio of the broadcast program can appreciate the broadcast by listening to the subtitle audio generated from the mobile device 6 at hand. In addition, if the user listens to the sound through the headphones connected to the mobile device 6, the user who uses the headphones is prevented from hearing the sound from the speaker 28, and the user who is listening to the sound from the speaker 28. The subtitle sound can be suppressed from being heard.

なお、モバイル機器６は、例えばBluetoothに対応したスピーカ装置のようなものであってもよい。 Note that the mobile device 6 may be a speaker device compatible with Bluetooth, for example.

また、本実施形態において、上記第２実施形態のようにインターネット接続可能な構成（図４）を適用することも可能である。 Further, in the present embodiment, it is also possible to apply a configuration (FIG. 4) capable of connecting to the Internet as in the second embodiment.

＜第５実施形態＞
上記第１実施形態〜第４実施形態において、合成音声生成部２１（または２１’）は更に以下のような構成を採ってもよい。 <Fifth Embodiment>
In the first to fourth embodiments, the synthesized speech generation unit 21 (or 21 ′) may further have the following configuration.

合成音声生成部２１は、音声デコーダ１６が出力する音声データに基づき音声が無いことを検出した場合、ナレーションであると判断し、声のスピード及び強弱は一定として字幕テキストデータから合成音声を生成する。この際、合成音声パターンは男性または女性の所定のパターンを使用する。 When the synthesized voice generating unit 21 detects that there is no voice based on the voice data output from the voice decoder 16, the synthesized voice generating unit 21 determines that the voice is narrated, and generates synthesized voice from the subtitle text data with a constant voice speed and strength. . At this time, a predetermined pattern of male or female is used as the synthesized voice pattern.

または、合成音声生成部２１は、映像デコーダ１４が出力する映像データに基づき映像に人物が映っていないことを検出した場合に、上記のようにナレーションと判断してもよい。 Alternatively, when the synthesized audio generation unit 21 detects that no person is shown in the video based on the video data output from the video decoder 14, it may determine that the voice is narrated as described above.

このような本実施形態によれば、ナレーションと判断した場合は、単調な字幕音声が発生することとなり、ユーザはあたかもナレーションであると思わせる字幕音声を聞くことができる。 According to the present embodiment, when it is determined that the narration is determined, monotonous subtitle sound is generated, and the user can listen to the subtitle sound that makes the user think that it is narration.

以上、本発明の実施形態について説明したが、本発明の趣旨の範囲内であれば、実施形態は種々の変更が可能である。 As mentioned above, although embodiment of this invention was described, if it is in the range of the meaning of this invention, embodiment can be variously changed.

例えば、本発明は、放送信号を受信できるものであれば、テレビ装置に限らず、例えばハードディスクレコーダ、光ディスクレコーダやセットトップボックスなどに適用しても構わない。また、映像音声及び字幕は、放送信号に限らず、再生信号に基づくものでも構わない。 For example, the present invention is not limited to a television device as long as it can receive a broadcast signal, and may be applied to, for example, a hard disk recorder, an optical disk recorder, a set top box, and the like. Also, the video and audio and the subtitle are not limited to the broadcast signal, but may be based on the reproduction signal.

１テレビ装置
２アンテナ
３インターネット
４サーバ装置
５ヘッドホン
６モバイル機器
１１チューナ
１２復調部
１３分離部
１４映像デコーダ
１５データデコーダ
１６音声デコーダ
１７映像出力部
１８表示部
１９ＯＳＤ部
２０音声解析部
２１合成音声生成部
２２音声出力部
２３スピーカ
２４ネットワークインタフェース
２５音声出力部
２６外部出力端子
２７音声出力部
２８スピーカ
２９音声出力部
３０無線通信部 DESCRIPTION OF SYMBOLS 1 Television apparatus 2 Antenna 3 Internet 4 Server apparatus 5 Headphone 6 Mobile device 11 Tuner 12 Demodulation part 13 Separation part 14 Video decoder 15 Data decoder 16 Audio decoder 17 Video output part 18 Display part 19 OSD part 20 Audio analysis part 21 Synthetic voice production | generation Unit 22 audio output unit 23 speaker 24 network interface 25 audio output unit 26 external output terminal 27 audio output unit 28 speaker 29 audio output unit 30 wireless communication unit

Claims

A voice analysis unit that analyzes the state of the person's speech based on the input voice data;
A synthesized voice generation unit that generates subtitle voice that is a synthesized voice based on the caption data corresponding to the voice data and the analysis result by the voice analysis unit;
A subtitle sound generating apparatus comprising:

The subtitle sound generating apparatus according to claim 1, wherein the person's way of speaking is voice speed and / or voice strength.

The voice analysis unit detects a plurality of persons based on the voice data,
The synthesized speech generation unit generates the subtitle speech by assigning the synthesized speech pattern to each of the plurality of detected persons from a plurality of synthesized speech patterns stored and prepared in advance in a storage unit. The caption audio generation device according to claim 1 or 2, characterized in that

4. The caption audio generation device according to claim 3, wherein the synthesized audio generation unit prepares the plurality of synthesized audio patterns in advance on the storage unit based on program information.

5. The caption audio generation device according to claim 3, wherein the plurality of synthesized speech patterns are synthesized speech patterns of different voice qualities prepared for each gender.

The synthetic voice generation unit generates the monotonic subtitle audio based on the subtitle data when detecting that there is no audio or detecting that no person is shown in the video. The caption audio generation device according to any one of claims 1 to 5.

The subtitle sound generation according to any one of claims 1 to 6, wherein sound based on the sound data is generated from a built-in speaker and sound based on the subtitle sound is output from an external output terminal. apparatus.

The sound based on the sound data is generated from a built-in speaker, and the subtitle sound is transmitted to an external mobile device using a radio signal. Subtitle audio generator.