JP2016009430A

JP2016009430A - Video classifier learning device and program

Info

Publication number: JP2016009430A
Application number: JP2014131273A
Authority: JP
Inventors: 貴裕奥; Takahiro Oku; 庄衛佐藤; Shoe Sato; 貴裕望月; Takahiro Mochizuki
Original assignee: Nippon Hoso Kyokai NHK; Japan Broadcasting Corp
Current assignee: Japan Broadcasting Corp
Priority date: 2014-06-26
Filing date: 2014-06-26
Publication date: 2016-01-18
Anticipated expiration: 2034-06-26
Also published as: JP6344849B2

Abstract

PROBLEM TO BE SOLVED: To highly accurately learn a discriminator for discriminating whether or not an image corresponds to a detection object such as a body of matter and a scene.SOLUTION: An explanation voice extraction unit 21 of an image discriminator learning device 1 compares a first voice signal as a voice signal of a program voice, with a second voice signal as a voice signal of the program voice to which the explanation voice is added, and extracts an explanation voice signal as the voice signal of the explanation voice. A voice recognition unit 24 recognizes the voice of the explanation voice signal extracted by the explanation voice extraction unit 21. A discrimination learning processing unit 3 extracts a word that serves as a label from a result of the voice recognition by the voice recognition unit 24, and learns a discriminator 34 for detecting whether or not image data relate to the extracted word, from a feature amount of the image data, using the extracted word and the feature amount extracted from the program image data in an image section corresponding to an utterance time of the word.

Description

本発明は、映像識別器学習装置、及びプログラムに関する。 The present invention relates to a video classifier learning device and a program.

映像にどのような物体が映っているのか、あるいは、映像がどのようなシーンであるかを認識する物体認識技術では、認識を行うための識別器を事前に学習する。物体認識を行う際、識別器は、映像から抽出された特徴を入力とし、その映像に検出対象の物体が映っているか否か、あるいは、その映像が検出対象のシーンであるか否かを判別し、判別結果を２値で出力する。識別器の学習には、映っている物体やシーンを表すラベルが付与された多量の学習用の映像データが用いられる。学習用の映像データにラベルを付与する方法には、人手で付与する方法や（例えば、非特許文献１参照）、放送のクローズドキャプションを利用する方法もある（例えば、非特許文献２参照）。 In the object recognition technology for recognizing what kind of object appears in the video or what kind of scene the video is, a classifier for recognition is learned in advance. When performing object recognition, the discriminator receives the feature extracted from the video and determines whether or not the object to be detected is reflected in the video or whether the video is a scene to be detected. Then, the discrimination result is output as a binary value. For learning of the discriminator, a large amount of video data for learning to which a label representing the object or scene being shown is attached is used. As a method of assigning a label to video data for learning, there are a method of assigning manually (for example, refer to Non-Patent Document 1) and a method of using closed captions for broadcasting (for example, refer to Non-Patent Document 2).

P. Duygulu，K. Barnard，J.F.G. de Freitas，D.A. Forsyth1、「Object Recognition as Machine Translation: Learning a Lexicon for a Fixed Image Vocabulary」、ECCV '02 Proceedings of the 7th European Conference on Computer Vision-Part IV、２００２年、ｐ．９７−１１２P. Duygulu, K. Barnard, JFG de Freitas, DA Forsyth1, "Object Recognition as Machine Translation: Learning a Lexicon for a Fixed Image Vocabulary", ECCV '02 Proceedings of the 7th European Conference on Computer Vision-Part IV, 2002 , P. 97-112 河合吉彦、藤井真人、「クローズドキャプションと画像特徴を考慮した反復的学習に基づく映像検索システム」、一般社団法人映像情報メディア学会、映像情報メディア学会年次大会講演予稿集、２０１２年、23-7-1-"23-7-2"Yoshihiko Kawai, Masato Fujii, “Video Retrieval System Based on Iterative Learning Considering Closed Captions and Image Features”, Video Information Media Society, Video Information Media Society Annual Conference Proceedings, 2012, 23-7 -1- "23-7-2"

上述のように、映像に映っている物体や映像のシーンなどの検出対象を認識するための識別器の学習には、検出対象に対応したラベルが付与された学習用の映像データが必要である。 As described above, learning of the discriminator for recognizing a detection target such as an object reflected in a video or a scene of video requires video data for learning with a label corresponding to the detection target. .

ところで、放送番組における解説放送は、放送番組のナレーションやセリフとは別に、主に視覚障碍者のために番組の映像を解説音声で説明するサービスである。解説放送では、人物の動作を描写したり、場所・時刻を伝えたり、映像中の文字を読んだりする。一方、字幕放送は、聴力に障碍のある方や、テレビの音声が聞こえにくい高齢者の方のために、テレビの音声を文字で字幕化するサービスである。このサービスで放送される字幕が、「クローズドキャプション」である。 By the way, commentary broadcasting in a broadcast program is a service that explains the video of a program with commentary audio mainly for the visually impaired, apart from narration and dialogue of the broadcast program. In commentary broadcasting, the movement of a person is depicted, the location and time are conveyed, and the characters in the video are read. On the other hand, subtitle broadcasting is a service that converts TV audio into subtitles for people with hearing impairments and elderly people who are difficult to hear TV audio. The closed caption broadcasted by this service is “closed caption”.

図５は、あるドラマ番組のクローズドキャプションと解説音声の書き起こしを時系列で並べた例を示す図である。開始時刻及び終了時刻は、クローズドキャプションの字幕に書き起こしの内容が表示された開始時刻及び終了時刻、あるいは、解説音声において書き起こしの内容が発話された開始時刻及び終了時刻を示す。字幕／解説は、書き起こしの内容がクローズドキャプションの字幕から得られたか、解説音声から得られたかの種別を示す。＜字幕＞と＜解説＞の書き起こしの内容を比較すると、登場人物のセリフなど、耳で聞いて認識するような情報は、<字幕＞には含まれるが、＜解説＞には含まれない。一方で、「職員室」、「笑顔」など、目で見て認識するような情報は、＜解説＞に含まれるが＜字幕＞には含まれない。 FIG. 5 is a diagram illustrating an example in which closed captions of a certain drama program and transcripts of commentary audio are arranged in time series. The start time and end time indicate the start time and end time when the content of the transcript is displayed on the closed caption caption, or the start time and end time when the content of the transcript is spoken in the commentary voice. The caption / explanation indicates the type of whether the content of the transcription is obtained from closed caption subtitles or commentary audio. When comparing the transcripts of <Subtitles> and <Description>, information that can be heard and recognized by the ear, such as the characters of the characters, is included in <Subtitles> but not in <Description>. . On the other hand, information that can be visually recognized such as “staff room” and “smile” is included in <Explanation> but not in <Subtitle>.

上記のように、クローズドキャプションには映像に映っている物体や映像のシーンを表す単語が含まれていないことも多い。そのため、人手によるラベル付与にかかる負担を軽減するために、放送のクローズドキャプションを利用して映像データにラベルを付与した場合、十分な学習用の映像データを用意することができず、高い精度の識別器を学習することができない可能性があった。 As described above, closed captions often do not include an object appearing in a video or a word representing a video scene. Therefore, in order to reduce the burden of manual labeling, if video data is labeled using closed captions for broadcasting, sufficient video data for learning cannot be prepared, and high accuracy is achieved. There was a possibility that the classifier could not be learned.

本発明は、このような事情を考慮してなされたもので、物体やシーンなどの検出対象に対応した映像であるか否かを識別するための識別器を精度よく学習することができる映像識別器学習装置、及びプログラムを提供する。 The present invention has been made in view of such circumstances, and is capable of accurately learning a discriminator for identifying whether or not the video corresponds to a detection target such as an object or a scene. A learning device and program are provided.

本発明の一態様は、番組音声の音声信号である第１音声信号と、解説音声が付加された前記番組音声の音声信号である第２音声信号とを比較して前記解説音声の音声信号である解説音声信号を抽出する解説音声抽出部と、前記解説音声抽出部が抽出した前記解説音声信号を音声認識する音声認識部と、前記音声認識部による音声認識の結果からラベルとなる単語を抽出し、抽出した前記単語と、前記単語の発話時刻に対応した映像区間の番組映像データから抽出した特徴量とを用いて、抽出した前記単語に映像データが関連するか否かを前記映像データの特徴量から検出するための識別器を学習する識別器学習処理部と、を備えることを特徴とする映像識別器学習装置である。
この発明によれば、映像識別器学習装置は、番組音声の音声信号と、解説音声が付加された番組音声の音声信号とを比較して抽出した解説音声信号を音声認識する。映像識別器学習装置は、音声認識の結果からラベルとなる単語を抽出し、抽出した単語と、当該単語の発話時刻に対応した映像区間の番組映像データから抽出した特徴量とを用いて識別器を学習する。
これにより、映像識別器学習装置は、人手をかけることなく、精度よく映像データにラベルを付与して識別器の学習に用いることができるため、従来よりも精度の高い識別器を学習することができる。 One aspect of the present invention compares the first audio signal, which is the audio signal of the program audio, with the second audio signal, which is the audio signal of the program audio, to which the explanation audio is added. A commentary speech extraction unit that extracts a commentary speech signal, a speech recognition unit that recognizes the commentary speech signal extracted by the commentary speech extraction unit, and a word that is a label is extracted from the result of speech recognition by the speech recognition unit Then, using the extracted word and the feature amount extracted from the program video data of the video section corresponding to the utterance time of the word, whether or not the video data is related to the extracted word is determined. An image discriminator learning device comprising: a discriminator learning processing unit that learns a discriminator for detection from a feature amount.
According to the present invention, the video discriminator learning apparatus recognizes the comment audio signal extracted by comparing the audio signal of the program audio and the audio signal of the program audio to which the comment audio is added. The video discriminator learning device extracts a word as a label from the result of speech recognition, and uses the extracted word and a feature amount extracted from program video data of a video section corresponding to the utterance time of the word. To learn.
As a result, the video discriminator learning device can label the video data with high accuracy and use it for learning of the discriminator without manpower, so that it can learn a discriminator with higher accuracy than before. it can.

本発明の一態様は、上述する映像識別器学習装置であって、前記識別器学習処理部は、前記番組映像データをシーン毎に分割した分割映像データを出力するシーン分割部と、前記音声認識部による音声認識の結果からラベルとなる単語を抽出し、抽出した前記単語の発話時刻に対応した時刻の前記分割映像データに前記単語をラベルとして付与するラベル付与部と、前記ラベル付与部がラベルを付与した前記分割映像データを用いて前記識別器を学習する識別器学習部とを備える、ことを特徴とする。
この発明によれば、映像識別器学習装置は、番組映像データの各シーンにラベルを付与し、シーン毎にラベルが付与された番組映像データを用いて識別器を学習する。
これにより、映像識別器学習装置は、映像のシーン毎に解説音声に基づいてラベルを付与するため、解説音声の発話のタイミングと発話内容に対応する映像のタイミングとがずれている場合でも、精度よく映像データにラベルを付与することができる。 One aspect of the present invention is the video discriminator learning device described above, wherein the discriminator learning processing unit includes a scene division unit that outputs divided video data obtained by dividing the program video data for each scene, and the voice recognition A label providing unit that extracts a word to be a label from a result of speech recognition by the unit, and assigns the word as a label to the divided video data at a time corresponding to the utterance time of the extracted word; And a discriminator learning unit that learns the discriminator using the divided video data to which is assigned.
According to the present invention, the video discriminator learning device assigns a label to each scene of the program video data, and learns the discriminator using the program video data to which the label is assigned for each scene.
As a result, the video discriminator learning device assigns a label based on the commentary audio for each scene of the video, so even if the timing of the commentary speech is different from the timing of the video corresponding to the utterance content, it is accurate. A label can be often given to video data.

本発明の一態様は、上述する映像識別器学習装置であって、前記音声認識部が音声認識に用いる音響モデルを、前記解説音声抽出部が抽出した前記解説音声信号と前記音声認識部による前記音声認識の結果とを用いて適応化する処理と、前記音声認識部が音声認識に用いる言語モデルを、前記音声認識部による前記音声認識の結果、番組のクローズドキャプション、番組情報のうち１以上を用いて適応化する処理との少なくともいずれか一方を行う適応化部をさらに備える、ことを特徴とする。
この発明によれば、映像識別器学習装置は、解説者に適応化した音響モデルや、番組に適応化した言語モデルにより解説音声信号を音声認識し、音声認識した結果得られた単語をラベルとして映像データに付加する。
これにより、映像識別器学習装置は、精度よく解説音声を音声認識することができるため、映像データに適切なラベルを付与することができる。 One aspect of the present invention is the video classifier learning device described above, wherein the commentary speech signal extracted by the commentary speech extraction unit and the speech recognition unit extract an acoustic model used by the speech recognition unit for speech recognition. A process of adapting using a result of voice recognition, a language model used by the voice recognition unit for voice recognition, a result of the voice recognition by the voice recognition unit, a closed caption of the program, and one or more of program information. It further comprises an adapting unit that performs at least one of processing to be used and adapted.
According to the present invention, the video discriminator learning device recognizes a commentary speech signal using an acoustic model adapted to a commentator or a language model adapted to a program, and uses a word obtained as a result of the speech recognition as a label. Append to video data.
Thereby, since the video discriminator learning apparatus can recognize the commentary speech with high accuracy, it is possible to give an appropriate label to the video data.

本発明の一態様は、上述する映像識別器学習装置であって、前記識別器学習処理部により学習された前記識別器を用いて映像データを認識する認識部をさらに備える、ことを特徴とする。
この発明によれば、映像識別器学習装置は、学習した識別器を用いて映像データを認識する。
これにより、映像識別器学習装置は、映像に映っている物体や映像に含まれるシーンを精度よく識別することができる。 One aspect of the present invention is the video discriminator learning device described above, further comprising a recognition unit that recognizes video data using the discriminator learned by the discriminator learning processing unit. .
According to the present invention, the video classifier learning device recognizes video data using the learned classifier.
Thereby, the video discriminator learning device can accurately identify an object shown in the video and a scene included in the video.

本発明の一態様は、上述する映像識別器学習装置であって、前記第１音声信号及び前記第２音声信号は、放送番組の主音声信号及び副音声信号である、ことを特徴とする。
この発明によれば、映像識別器学習装置は、解説放送番組の主音声信号及び副音声信号から解説音声信号を抽出し、抽出した解説音声信号を音声認識した結果得られた単語をラベルとして放送番組の映像データに付加する。
これにより、映像識別器学習装置は、放送番組を利用して識別器を学習することができる。 One aspect of the present invention is the video discriminator learning device described above, wherein the first audio signal and the second audio signal are a main audio signal and a sub audio signal of a broadcast program.
According to the present invention, the video discriminator learning device extracts a comment audio signal from the main audio signal and sub audio signal of the comment broadcast program, and broadcasts the word obtained as a result of the voice recognition of the extracted comment audio signal as a label. It is added to the video data of the program.
Thereby, the video discriminator learning device can learn the discriminator using the broadcast program.

本発明の一態様は、コンピュータを、番組音声の音声信号である第１音声信号と、解説音声が付加された前記番組音声の音声信号である第２音声信号とを比較して前記解説音声の音声信号である解説音声信号を抽出する解説音声抽出手段と、前記解説音声抽出手段が抽出した前記解説音声信号を音声認識する音声認識手段と、前記音声認識手段による音声認識の結果からラベルとなる単語を抽出し、抽出した前記単語と、前記単語の発話時刻に対応した映像区間の番組映像データから抽出した特徴量とを用いて、抽出した前記単語に映像データが関連するか否かを前記映像データの特徴量から検出するための識別器を学習する識別器学習処理手段と、を具備する映像識別器学習装置として機能させるためのプログラムである。 According to one embodiment of the present invention, a computer compares a first audio signal that is an audio signal of a program audio with a second audio signal that is an audio signal of the program audio to which the explanation audio is added. The commentary voice extracting means for extracting the commentary voice signal, which is a voice signal, the voice recognition means for recognizing the commentary voice signal extracted by the commentary voice extraction means, and the result of the voice recognition by the voice recognition means becomes a label. Extracting a word, and using the extracted word and a feature amount extracted from program video data of a video section corresponding to the utterance time of the word, whether or not video data is related to the extracted word It is a program for functioning as a video discriminator learning device comprising discriminator learning processing means for learning a discriminator for detection from a feature amount of video data.

本発明によれば、物体やシーンなどの検出対象に対応した映像であるか否かを識別するための識別器を精度よく学習することができる。 ADVANTAGE OF THE INVENTION According to this invention, the discriminator for identifying whether it is an image | video corresponding to detection targets, such as an object and a scene, can be learned accurately.

同実施形態による映像識別器学習装置の構成を示す機能ブロック図である。It is a functional block diagram which shows the structure of the image | video discriminator learning device by the embodiment. 同実施形態による映像識別器学習装置の識別器学習処理の処理フローを示す図である。It is a figure which shows the processing flow of the discriminator learning process of the image | video discriminator learning device by the embodiment. 同実施形態による解説音声抽出部の解説音声信号抽出処理の処理フローを示す図である。It is a figure which shows the processing flow of the comment audio | voice signal extraction process of the comment audio | voice extraction part by the embodiment. 同実施形態による音声認識結果を示す図である。It is a figure which shows the speech recognition result by the same embodiment. クローズドキャプションと解説音声の書き起こしを時系列で並べた図である。It is the figure which arranged closed caption and transcript of commentary voice in time series.

以下、図面を参照しながら本発明の実施形態を詳細に説明する。
本実施形態の映像識別器学習装置は、解説放送番組の主音声チャンネルと副音声チャンネルを比較して解説音声を抽出する。本実施形態の映像識別器学習装置は、抽出した解説音声を音声認識した結果得られた単語を、その単語の発話時刻に対応した映像区間の番組映像データにラベルとして付与する。本実施形態の映像識別器学習装置は、ラベルが付与された番組映像データを用いて、映像が物体やシーンなどの検出対象に対応するか否かを判定するための識別器を学習する。識別器は、検出対象が物体である場合は、検出対象の物体が映像中に表示されているか否かを判定し、検出対象がシーンであれば、映像に表示される一連の動作がその検出対象のシーンであるか否かを判定する。 Hereinafter, embodiments of the present invention will be described in detail with reference to the drawings.
The video discriminator learning apparatus according to the present embodiment extracts the commentary audio by comparing the main audio channel and the sub audio channel of the comment broadcast program. The video discriminator learning device of this embodiment gives a word obtained as a result of voice recognition of the extracted commentary voice as a label to program video data in a video section corresponding to the utterance time of the word. The video discriminator learning device according to the present embodiment learns a discriminator for determining whether video corresponds to a detection target such as an object or a scene, using program video data to which a label is attached. When the detection target is an object, the discriminator determines whether or not the detection target object is displayed in the video. If the detection target is a scene, a series of operations displayed in the video is detected. It is determined whether or not the target scene.

上記のように、本実施形態の映像識別器学習装置は、映像中の物体やシーンを説明する解説音声の認識結果に基づいて映像データにラベルを付与して学習用映像データを生成する。従って、本実施形態の映像識別器学習装置は、人的コスト及び時間的コストを低減しながら、クローズドキャプションを用いるよりも精度よく映像データにラベルを付与し、大量の学習用映像データを生成することができる。本実施形態の映像識別器学習装置は、このように精度よくラベルが付与された大量の学習用映像データを用いることによって、精度の高い識別器を学習することができる。 As described above, the video discriminator learning device according to the present embodiment generates learning video data by assigning a label to video data based on a recognition result of commentary audio that explains an object or a scene in the video. Therefore, the video discriminator learning device of the present embodiment generates a large amount of video data for learning by assigning labels to video data more accurately than using closed captioning while reducing human costs and time costs. be able to. The video discriminator learning device according to the present embodiment can learn a discriminator with high accuracy by using a large amount of video data for learning with labels attached with high accuracy in this way.

図１は、本実施形態の映像識別器学習装置１の構成を示す機能ブロック図であり、本実施形態と関係する機能ブロックのみを抽出して示してある。映像識別器学習装置１は、例えば、コンピュータ装置により実現され、解説音声認識結果抽出部２、識別器学習処理部３、及び認識処理部４を備えて構成される。 FIG. 1 is a functional block diagram showing the configuration of the video discriminator learning device 1 according to the present embodiment, in which only functional blocks related to the present embodiment are extracted and shown. The video discriminator learning device 1 is realized by a computer device, for example, and includes a commentary speech recognition result extraction unit 2, a discriminator learning processing unit 3, and a recognition processing unit 4.

解説音声認識結果抽出部２は、解説音声抽出部２１、音響モデル記憶部２２、言語モデル記憶部２３、音声認識部２４、及びモデル適応化部２５を備えて構成される。
解説音声抽出部２１は、解説放送番組の主音声信号と副音声信号とを比較し、解説音声信号を抽出する。主音声信号は、解説なしの番組音声の音声信号（第１音声信号）であり、副音声信号は、番組音声に対して解説音声を付加した解説付き音声の音声信号（第２音声信号）である。解説付き音声においては、番組音声と解説音声は重ならずに発話されることがほとんどである。解説音声信号は、解説音声の音声信号である。 The commentary speech recognition result extraction unit 2 includes a commentary speech extraction unit 21, an acoustic model storage unit 22, a language model storage unit 23, a speech recognition unit 24, and a model adaptation unit 25.
The commentary sound extraction unit 21 compares the main sound signal and the sub sound signal of the commentary broadcast program, and extracts the commentary sound signal. The main audio signal is an audio signal of the program audio without explanation (first audio signal), and the sub audio signal is an audio signal of the audio with explanation (second audio signal) obtained by adding the explanation audio to the program audio. is there. In the audio with explanation, the program audio and the explanation audio are usually uttered without overlapping. The commentary audio signal is a commentary sound signal.

音響モデル記憶部２２は、各音素の周波数特性を表す音響モデルを記憶する。言語モデル記憶部２３は、単語のつながり易さを表す言語モデルを記憶する。音声認識部２４は、解説音声抽出部２１が抽出した解説音声信号を、音響モデル記憶部２２に記憶されている音響モデル、及び、言語モデル記憶部２３に記憶されている言語モデルを用いて音声認識する。音声認識部２４は、解説音声信号の音声認識結果を設定した音声認識結果データを出力する。音声認識結果は、形態素単位の発話内容の書き起こしと、各形態素が発話された開始時刻及び終了時刻を含む。 The acoustic model storage unit 22 stores an acoustic model representing the frequency characteristics of each phoneme. The language model storage unit 23 stores a language model that represents the ease with which words are connected. The voice recognizing unit 24 uses the acoustic model stored in the acoustic model storage unit 22 and the language model stored in the language model storage unit 23 as a voice for the commentary voice signal extracted by the commentary voice extraction unit 21. recognize. The voice recognition unit 24 outputs voice recognition result data in which the voice recognition result of the commentary voice signal is set. The speech recognition result includes a transcription of the utterance content in units of morphemes, and a start time and an end time at which each morpheme is uttered.

モデル適応化部２５は、音響モデル記憶部２２に記憶されている音響モデルを解説音声の発話者に応じて適応化し、言語モデル記憶部２３に記憶されている言語モデルを番組に応じて適応化する。音響モデル及び言語モデルの適応化には、解説音声信号と、その解説音声信号の音声認識結果が示す解説内容が用いられる。 The model adaptation unit 25 adapts the acoustic model stored in the acoustic model storage unit 22 according to the speaker of the commentary speech, and adapts the language model stored in the language model storage unit 23 according to the program. To do. For the adaptation of the acoustic model and the language model, the commentary speech signal and the commentary content indicated by the speech recognition result of the commentary speech signal are used.

識別器学習処理部３は、シーン分割部３１、ラベル付与部３２、及び識別器学習部３３を備えて構成される。
シーン分割部３１は、解説放送番組の映像データ（番組映像データ）を類似したシーン毎に分割し、シーン毎に分割された映像データである分割映像データを出力する。ラベル付与部３２は、シーン分割部３１から出力された分割映像データに、音声認識部２４から出力された音声認識結果データが示す解説音声の音声認識結果に基づいてラベルを付与する。識別器学習部３３は、ラベル付与部３２がラベルを付与した分割映像データを学習用映像データとして用い、識別器３４を学習する。識別器３４は、映像データから得られる映像の特徴量に基づいて、映像データが検出対象の単語に関連するか否かを検出する。 The classifier learning processing unit 3 includes a scene dividing unit 31, a label adding unit 32, and a classifier learning unit 33.
The scene division unit 31 divides the video data (program video data) of the explanation broadcast program for each similar scene, and outputs divided video data that is video data divided for each scene. The label assigning unit 32 assigns a label to the divided video data output from the scene dividing unit 31 based on the voice recognition result of the commentary voice indicated by the voice recognition result data output from the voice recognition unit 24. The discriminator learning unit 33 learns the discriminator 34 using the divided video data provided with the label by the label adding unit 32 as learning video data. The discriminator 34 detects whether or not the video data is related to the word to be detected based on the video feature amount obtained from the video data.

認識処理部４は、シーン分割部４１、及び認識部４２を備えて構成される。
シーン分割部４１は、シーン分割部３１と同様の処理により、認識対象のコンテンツの映像データをシーン毎に分割し、分割映像データを出力する。認識対象のコンテンツは、解説放送番組でもよく、解説放送番組以外の放送番組でもよく、放送番組以外の動画でもよい。本実施形態では、認識対象のコンテンツが放送番組である場合を例に説明する。認識部４２は、シーン分割部４１から出力された分割映像データに対して、識別器学習部３３が学習した識別器３４により認識を行い、認識結果を設定した認識結果データを出力する。 The recognition processing unit 4 includes a scene dividing unit 41 and a recognition unit 42.
The scene division unit 41 divides the video data of the content to be recognized for each scene by the same processing as the scene division unit 31 and outputs the divided video data. The content to be recognized may be a commentary broadcast program, a broadcast program other than the commentary broadcast program, or a video other than the broadcast program. In the present embodiment, a case where the content to be recognized is a broadcast program will be described as an example. The recognizing unit 42 recognizes the divided video data output from the scene dividing unit 41 by the discriminator 34 learned by the discriminator learning unit 33, and outputs the recognition result data in which the recognition result is set.

続いて、映像識別器学習装置１の動作を説明する。
図２は、映像識別器学習装置１の識別器学習処理の処理フローを示す図である。
まず、映像識別器学習装置１に解説放送番組の主音声信号、副音声信号、及び、映像データが入力される（ステップＳ１１０：ＹＥＳ）。これらは、例えば、放送信号から得られる。ステレオ二重放送の場合、主音声と副音声のそれぞれＬチャンネルの音声信号及びＲチャンネルの音声信号が入力される。解説音声認識結果抽出部２は入力された主音声信号及び副音声信号を受信し、識別器学習処理部３は、入力された映像データを受信する。 Next, the operation of the video classifier learning device 1 will be described.
FIG. 2 is a diagram illustrating a processing flow of the discriminator learning process of the video discriminator learning device 1.
First, the main audio signal, sub audio signal, and video data of the commentary broadcast program are input to the video discriminator learning device 1 (step S110: YES). These are obtained from broadcast signals, for example. In the case of stereo duplex broadcasting, an L channel audio signal and an R channel audio signal of the main audio and sub audio are input. The commentary speech recognition result extraction unit 2 receives the input main audio signal and sub audio signal, and the discriminator learning processing unit 3 receives the input video data.

解説音声認識結果抽出部２の解説音声抽出部２１は、解説放送の主音声信号と副音声信号を比較して、解説音声信号を抽出する（ステップＳ１２０）。解説音声信号を精度よく抽出するために、主音声信号と副音声信号の同期が合っていない場合には、両信号の同期を正確に合わせてから比較する必要がある。そこでまず、解説音声抽出部２１は、主音声信号と副音声信号の同期を合わせる処理を行う。開始時刻ｔ、音声区間長Ｔの主音声信号、副音声信号をそれぞれ、ｘ_ｔ ^ｔ＋Ｔ＝［ｘ（ｔ）…ｘ（ｔ＋Ｔ−１）］、ｙ_ｔ ^ｔ＋Ｔ＝［ｙ（ｔ）…ｙ（ｔ＋Ｔ−１）］とする。解説音声抽出部２１は、以下の式（１）で表される相関係数ｒが最大となるように、同期ズレ時間ａを算出する。なお、音声区間長Ｔは、予想される同期ズレ時間ａよりも十分長い時間とする。 The commentary speech extraction unit 21 of the commentary speech recognition result extraction unit 2 compares the main sound signal and the sub sound signal of the comment broadcast, and extracts the comment sound signal (step S120). In order to accurately extract the explanation audio signal, if the main audio signal and the sub audio signal are not synchronized with each other, it is necessary to compare the synchronization of both signals after matching them accurately. Therefore, first, the commentary voice extraction unit 21 performs a process of synchronizing the main voice signal and the sub voice signal. The main audio signal and the sub audio signal having the start time t and the voice interval length T are respectively _xt ^{t + T} = [x (t)... X (t + T−1)], y _t ^{t + T} = [y (t)... Y (t + T -1)]. The commentary voice extraction unit 21 calculates the synchronization shift time a so that the correlation coefficient r expressed by the following formula (1) is maximized. Note that the speech interval length T is a time sufficiently longer than the expected synchronization shift time a.

ここで、Ｓ（ｘ，ｙ）は、変数ｘとｙの共分散を表し、Ｓ（ｘ）は変数ｘの標準偏差、Ｓ（ｙ）は変数ｙの標準偏差を表す。 Here, S (x, y) represents the covariance of variables x and y, S (x) represents the standard deviation of variable x, and S (y) represents the standard deviation of variable y.

解説音声抽出部２１は、音声区間長Ｔ毎に、算出された同期ズレ時間ａを用いて主音声信号と副音声信号の同期を合わせる。解説音声抽出部２１は、同期を合わせた主音声信号と副音声信号から、副音声信号に重畳されている解説音声の音声信号を抽出する。解説音声の音声信号の抽出方法には、例えば、以下の抽出方法Ａ〜Ｃがある。 The commentary voice extraction unit 21 synchronizes the main voice signal and the sub voice signal for each voice section length T using the calculated synchronization time a. The commentary voice extraction unit 21 extracts a voice signal of the commentary voice superimposed on the sub voice signal from the synchronized main voice signal and sub voice signal. For example, there are the following extraction methods A to C as methods for extracting the speech signal of the commentary speech.

（抽出方法Ａ）
解説音声抽出部２１は、主音声信号及び副音声信号のそれぞれに対して短時間の窓かけを行う。窓かけを行う部分は、窓の大きさに応じて時間方向に順にシフトさせる。解説音声抽出部２１は、窓毎に主音声信号のパワーと副音声信号のパワーの差を計算する。解説音声抽出部２１は、計算したパワーの差が、予め決められた閾値よりも小さい場合、その窓の音声区間は解説音声以外の音声区間と判断する。一方、解説音声抽出部２１は、計算したパワーの差が、予め決められた閾値以上である場合、その窓の音声区間を解説音声区間として特定する。解説音声抽出部２１は、特定した解説音声区間における副音声信号を解説音声信号として抽出する。解説音声信号には、開始時刻及び終了時刻が付与される。なお、抽出方法Ａを用いる場合、主音声信号と副音声信号の同期を合わせる処理は必ずしも必要ではない。 (Extraction method A)
The commentary voice extraction unit 21 performs windowing for a short time on each of the main voice signal and the sub voice signal. The windowed portion is sequentially shifted in the time direction according to the size of the window. The commentary sound extraction unit 21 calculates the difference between the power of the main sound signal and the power of the sub sound signal for each window. When the calculated power difference is smaller than a predetermined threshold, the commentary voice extraction unit 21 determines that the voice segment of the window is a voice segment other than the commentary speech. On the other hand, when the calculated power difference is equal to or greater than a predetermined threshold, the commentary speech extraction unit 21 specifies the speech segment of the window as the commentary speech segment. The commentary voice extraction unit 21 extracts a sub-voice signal in the specified commentary voice section as a commentary voice signal. The comment audio signal is given a start time and an end time. Note that when the extraction method A is used, a process for synchronizing the main audio signal and the sub audio signal is not necessarily required.

（抽出方法Ｂ）
解説音声抽出部２１は、副音声信号を解説音声に解説音声以外の雑音が付加された音声信号、主音声信号を解説音声以外の雑音の音声信号と考え、スペクトルサブトラクション法により、副音声信号から解説音声信号のみを抽出する。このとき、解説音声抽出部２１は、副音声信号から雑音として除去する音声信号の特徴を、主音声信号から取得する。 (Extraction method B)
The explanation voice extraction unit 21 considers the sub voice signal as a voice signal in which noise other than the explanation voice is added to the explanation voice and the main voice signal as a voice signal of noise other than the explanation voice, and extracts the sub voice signal from the sub voice signal by the spectral subtraction method. Extract only the audio signal. At this time, the commentary voice extraction unit 21 acquires the characteristics of the voice signal to be removed as noise from the sub voice signal from the main voice signal.

（抽出方法Ｃ）
抽出方法Ｃでは、解説音声抽出部２１は、副音声信号が示す音声から主音声信号が示す音声を減算し、その差分の音声の音声信号を解説音声信号として抽出する。ただし、解説音声が重畳されていない区間において、主音声と副音声の間で音声レベルに差がある場合には、単純に減算処理するだけでは、解説以外の音声を精度よく除去することができず、解説音声に対する雑音として残留してしまう。このような雑音を精度よく除去するため、解説音声抽出部２１は、音声区間長Ｔ毎に主音声信号と副音声信号の相関係数を計算し、相関係数が一定の閾値以上の区間については、解説音声信号の値を全て０としてもよい。 (Extraction method C)
In the extraction method C, the commentary voice extraction unit 21 subtracts the voice indicated by the main voice signal from the voice indicated by the sub voice signal, and extracts the voice signal of the difference voice as the commentary voice signal. However, if there is a difference in the sound level between the main sound and the sub sound in the section where the commentary sound is not superimposed, the sound other than the commentary can be accurately removed by simply subtracting. Instead, it remains as noise for commentary speech. In order to remove such noise with high accuracy, the commentary speech extraction unit 21 calculates a correlation coefficient between the main speech signal and the sub-speech signal for each speech section length T, and for a section where the correlation coefficient is equal to or greater than a certain threshold. The values of the commentary audio signal may all be 0.

図３は、抽出方法Ｃによる解説音声抽出部２１の解説音声信号抽出処理の処理フローを示す図である。
まず、解説音声抽出部２１は、ｔに初期値０を設定する（ステップＳ２１０）。解説音声抽出部２１は、主音声信号ｘ_ｔ ^ｔ＋Ｔと副音声信号ｙ_ｔ＋ａ ^{ｔ＋ａ＋Ｔ}の相関係数ｒ_ｔ（ｘ_ｔ ^ｔ＋Ｔ，ｙ_ｔ＋ａ ^{ｔ＋ａ＋Ｔ}）が閾値ｒ_ｔｈｒｅ以上であるか否かを判断する（ステップＳ２２０）。相関係数は、上述した式（１）により算出される。 FIG. 3 is a diagram showing a processing flow of the commentary voice signal extraction process of the commentary voice extraction unit 21 by the extraction method C.
First, the commentary voice extraction unit 21 sets an initial value 0 to t (step S210). Comment sound extractor 21, the correlation coefficient of the main audio signal _x ^{t t + T} and the sub audio signal _{^{y t + a t + a +}} T r t (x t t + T, y t + a t + a + T) to determine whether a threshold _{r thre} more (step S220). The correlation coefficient is calculated by the above formula (1).

解説音声抽出部２１は、相関係数が閾値以上であると判断した場合（ステップＳ２２０：ＹＥＳ）、開始時刻ｔ、音声区間長Ｔの解説音声信号ｚ_ｔ ^ｔ＋Ｔの値を全て０とする（ステップＳ２３０）。
一方、解説音声抽出部２１は、相関係数が閾値未満であると判断した場合（ステップＳ２２０：ＮＯ）、解説音声信号ｚ_ｔ ^ｔ＋Ｔを、副音声信号ｙ_ｔ＋ａ ^{ｔ＋ａ＋Ｔ}から主音声信号ｘ_ｔ ^ｔ＋Ｔを減算した値とする（ステップＳ２４０）。 When the commentary speech extraction unit 21 determines that the correlation coefficient is equal to or greater than the threshold value (step S220: YES), all the values of the commentary speech signal z _t ^{t + T} of the start time t and the speech segment length ^T are set to 0 (steps) S230).
On the other hand, when the commentary voice extraction unit 21 determines that the correlation coefficient is less than the threshold value (step S220: NO), the commentary voice signal z _t ^{t + T} is obtained as the main voice signal x _t ^{t + T} from the sub voice signal y _{t + at} ^{+ a + T.} The subtracted value is set (step S240).

ステップＳ２３０またはステップＳ２４０の処理の後、解説音声抽出部２１は、全音声信号についてステップＳ２２０〜ステップＳ２４０の処理を終了したか否かを判断する（ステップＳ２５０）。解説音声抽出部２１は、まだ終了していないと判断した場合（ステップＳ２５０：ＮＯ）、ｔの値にＴを加算して更新した後（ステップＳ２６０）、ステップＳ２２０からの処理を繰り返す。そして、解説音声抽出部２１は、全音声信号についてステップＳ２２０〜ステップＳ２４０の処理を終了したと判断した場合（ステップＳ２５０：ＹＥＳ）、処理を終了する。 After the process of step S230 or step S240, the commentary voice extraction unit 21 determines whether or not the processes of step S220 to step S240 have been completed for all voice signals (step S250). If the commentary voice extraction unit 21 determines that it has not been completed yet (step S250: NO), it adds T to the value of t and updates it (step S260), and then repeats the processing from step S220. If the commentary voice extraction unit 21 determines that the process of steps S220 to S240 has been completed for all voice signals (step S250: YES), the commentary voice extraction unit 21 ends the process.

解説音声抽出部２１は、上記のいずれかの抽出方法により抽出した解説音声信号を音声認識部２４に出力する。 The commentary speech extraction unit 21 outputs the commentary speech signal extracted by any of the above extraction methods to the speech recognition unit 24.

図２において、音声認識部２４は、解説音声抽出部２１が抽出した解説音声信号を、音響モデル記憶部２２に記憶されている音響モデル、及び、言語モデル記憶部２３に記憶されている言語モデルを用いて従来技術と同様に音声認識する（ステップＳ１３０）。 In FIG. 2, the speech recognition unit 24 uses the commentary speech signal extracted by the commentary speech extraction unit 21, the acoustic model stored in the acoustic model storage unit 22, and the language model stored in the language model storage unit 23. Is recognized in the same manner as in the prior art (step S130).

図４は、音声認識結果の例を示す図である。同図に示すように、解説音声信号の音声認識結果は、解説音声の発話内容に含まれる単語と、それら各単語の開始時刻及び終了時刻を含む。解説音声の発話内容に含まれる単語は、形態素に相当する。例えば、開始時刻「０３：４６．５２」から終了時刻「０３：４７．０１」までは「笑顔」と発話され、開始時刻「０３：４７．０２」から終了時刻「０３：４７．０４」までは「で」と発話されたことを示す。
音声認識部２４は、音声認識結果を設定した音声認識結果データを出力する。 FIG. 4 is a diagram illustrating an example of a speech recognition result. As shown in the figure, the speech recognition result of the commentary speech signal includes the words included in the utterance content of the commentary speech and the start time and end time of each word. A word included in the utterance content of the explanation voice corresponds to a morpheme. For example, “smile” is spoken from the start time “03: 46.52” to the end time “03: 47.01”, and from the start time “03: 47.02” to the end time “03: 47.04”. Indicates that “de” was spoken.
The voice recognition unit 24 outputs voice recognition result data in which a voice recognition result is set.

モデル適応化部２５は、解説音声信号と、音声認識結果データが示すその解説音声信号の音声認識結果を用いて、音響モデル及び言語モデルを従来技術により適応化（教師なし適応化）する（ステップＳ１４０）。例えば、音響モデルの適応化の手法には、ＭＬＬＲやＭＡＰ推定がある。また、言語モデルの適応化の手法には、線形補間法がある。 The model adaptation unit 25 adapts the acoustic model and the language model (unsupervised adaptation) using the commentary speech signal and the speech recognition result of the commentary speech signal indicated by the speech recognition result data according to the prior art (step unsupervised adaptation). S140). For example, acoustic model adaptation methods include MLLR and MAP estimation. Further, there is a linear interpolation method as a language model adaptation method.

なお、ＭＬＬＲは、例えば、「C. J. Leggetter and P. C. Woodland、“Maximum likelihood linear regression for speaker adaptation of continuous density hidden Markov models”、Computer Speech and Language、１９９５年、Volume9、ｐ．１７１−１８５」（文献１）に記載されている。
また、ＭＡＰ推定は、例えば、「J. Gauvain and Chin-Hui Lee、“Maximum a Posteriori Estimation for Multivariate Gaussian Mixture Observations of Markov Chains”、Speech and Audio Processing、IEEE Transactionson、１９９４年、Volume2、Issue2、ｐ．２９１−２９８」（文献２）に記載されている。
また、線形補間法は、例えば、「北研二、“確率的言語モデル”、東京大学出版会、１９９９年、ｐ．６３−６６」（文献３）に記載されている。 The MLLR is, for example, “CJ Leggetter and PC Woodland,“ Maximum likelihood linear regression for speaker adaptation of continuous density hidden Markov models ”, Computer Speech and Language, 1995, Volume 9, p. 171-185 (Reference 1). It is described in.
MAP estimation is described in, for example, “J. Gauvain and Chin-Hui Lee,“ Maximum a Posteriori Estimation for Multivariate Gaussian Mixture Observations of Markov Chains ”, Speech and Audio Processing, IEEE Transactionson, 1994, Volume 2, Issue 2, p. 291-298 "(Reference 2).
The linear interpolation method is described in, for example, “Kita Kenji,“ Probabilistic Language Model ”, University of Tokyo Press, 1999, p. 63-66” (Reference 3).

同じ解説放送番組であれば、同一人物の解説音声が使用されることが多い。また、各ドラマでは放送番組によって使用されやすい単語も異なる。そこで、モデル適応化部２５は、解説放送番組に応じて音響モデル及び言語モデルを適応化し、適応化した音響モデル及び言語モデルをその解説放送番組の番組ＩＤと対応づけてそれぞれ音響モデル記憶部２２、言語モデル記憶部２３に登録する。モデル適応化部２５は、解説音声の音声認識結果に加えて、あるいは、解説音声の音声認識結果に代えて、解説放送番組のクローズドキャプションや、電子番組ガイドから取得したその解説放送番組の番組情報を言語モデルの適応化に利用することもできる。クローズドキャプションや電子番組ガイドは、例えば、放送信号から得ることができる。これらの情報を適応化に利用することで、解説音声で発話されるドラマの登場人物や場所の名前などの固有名詞を効率的に学習することができる。音声認識部２４は、適応化後の繰り返し処理で再びステップＳ１３０において音声認識を行う際、主音声信号及び副音声信号が得られた解説放送番組の番組ＩＤに対応付けられた音響モデル及び言語モデルを用いる。主音声信号及び副音声信号が得られた解説放送番組と同じジャンルの解説放送番組など、類似あるいは関連した解説放送番組の番組ＩＤに対応付けられた音響モデル及び言語モデルを用いてもよい。なお、モデル適応化部２５は、音響モデルの適応化と言語モデルの適応化の一方のみを行ってもよい。 In the case of the same commentary broadcast program, the commentary sound of the same person is often used. In each drama, words that are easy to use vary depending on the broadcast program. Therefore, the model adaptation unit 25 adapts the acoustic model and the language model in accordance with the explanation broadcast program, and associates the adapted acoustic model and language model with the program ID of the explanation broadcast program, respectively. The language model storage unit 23 is registered. In addition to the speech recognition result of the commentary speech or instead of the speech recognition result of the commentary speech, the model adaptation unit 25 includes closed captions of the commentary broadcast program or program information of the commentary broadcast program acquired from the electronic program guide. Can also be used to adapt language models. Closed captions and electronic program guides can be obtained from broadcast signals, for example. By using this information for adaptation, it is possible to efficiently learn proper nouns such as characters in drama and names of places uttered by commentary speech. When the speech recognition unit 24 performs speech recognition again in the repeated processing after adaptation in step S130, the acoustic model and the language model associated with the program ID of the explanation broadcast program from which the main audio signal and the sub audio signal are obtained. Is used. An acoustic model and a language model associated with a program ID of a similar or related explanation broadcast program such as an explanation broadcast program of the same genre as the explanation broadcast program from which the main audio signal and the sub audio signal are obtained may be used. Note that the model adaptation unit 25 may perform only one of the adaptation of the acoustic model and the adaptation of the language model.

識別器学習処理部３のシーン分割部３１は、既存の技術を用いて解説放送番組の映像データを類似したシーン毎に分割する（ステップＳ１４０）。このシーン分割には、画像特徴量を用いて同じ場面のショット（カットともいう）をまとめてシーンと判断する既存技術を用いることができる。 The scene dividing unit 31 of the discriminator learning processing unit 3 divides the video data of the commentary broadcast program into similar scenes using existing technology (step S140). For this scene division, it is possible to use an existing technique in which shots (also referred to as cuts) of the same scene are collectively determined as a scene using image feature amounts.

具体的には、シーン分割部３１が備える図示しない記憶部（映像識別器学習装置１に接続される外部の記憶装置でもよい。）に、予め、複数の映像から生成した画像片ワードを記憶しておく。画像片ワードを生成するためには、複数の映像からサンプリングしたフレーム画像を所定の画像サイズに区切ってブロック画像とし、各ブロック画像の特徴量を表す特徴ベクトルに基づいてブロック画像集合をクラスタリングする。画像片ワードは、各クラスタの中心ベクトルを要素するベクトルとして得られる。シーン分割部３１は、入力された解説放送番組の映像データを、既存の任意のショット検出技術を用いてショット毎に分割し、ショット系列を生成する。シーン分割部３１は、解説放送番組の映像データの各ショットから所定間隔毎のフレーム画像をサンプリングすると、サンプリングしたフレーム画像を所定の画像サイズに区切ったブロック画像から画像の特徴量を表す特徴ベクトルを取得する。シーン分割部３１は、ショット毎に特徴ベクトルと画像片ワードとの類似性に基づいて、画像片ワードのヒストグラムを算出する。このヒストグラムにより、各ショットにどの種類のブロック画像がどのくらい存在するかの出現比率が得られる。シーン分割部３１は、各ショットのヒストグラムの変化量に基づいて、ショットを統合してシーンに分割する。シーン分割部３１は、映像データをシーン毎に分割して分割映像データを生成し、ラベル付与部３２に出力する。 Specifically, image segment words generated from a plurality of videos are stored in advance in a storage unit (not shown) included in the scene division unit 31 (or an external storage device connected to the video classifier learning device 1). Keep it. In order to generate an image fragment word, a frame image sampled from a plurality of videos is divided into a predetermined image size to form a block image, and a block image set is clustered based on a feature vector representing a feature amount of each block image. The image fragment word is obtained as a vector that is an element of the center vector of each cluster. The scene division unit 31 divides the input video data of the commentary broadcast program for each shot using an existing arbitrary shot detection technique, and generates a shot sequence. When the scene dividing unit 31 samples a frame image at a predetermined interval from each shot of the video data of the commentary broadcast program, a feature vector representing a feature amount of the image is obtained from a block image obtained by dividing the sampled frame image into a predetermined image size. get. The scene dividing unit 31 calculates a histogram of the image fragment word based on the similarity between the feature vector and the image fragment word for each shot. With this histogram, an appearance ratio indicating how many types of block images exist in each shot can be obtained. The scene division unit 31 integrates shots and divides them into scenes based on the amount of change in the histogram of each shot. The scene dividing unit 31 divides the video data for each scene to generate divided video data, and outputs the divided video data to the label adding unit 32.

ラベル付与部３２は、シーン分割部３１から出力された各シーンの分割映像データに、音声認識部２４が出力した音声認識結果データにより示される解説音声の音声認識結果に基づいてラベルを付与する（ステップＳ１５０）。例えば、ラベル付与部３２は、まだラベルを付加していない分割映像データを一つ選択し、音声認識結果データが示す認識結果の単語の中から、選択した分割映像データの開始時刻及び終了時刻の範囲内に開始時刻または終了時刻が含まれる単語を特定する。ラベル付与部３２は、特定した単語から助詞や助動詞などの所定の品詞の単語を除外してラベルとなる単語を取得する。ラベル付与部３２は、取得した単語を示すラベルデータを選択中の分割映像データに付加する。ラベル付与部３２は、まだラベルを付加していない分割映像データを選択し、同様の処理を繰り返す。これにより、例えば、ラベル付与部３２は、開始時刻「０３：４５．００」、終了時刻「０３：５５．００」のシーンの分割映像データに、図４に示す解説音声の認識結果の単語のうち、「笑顔」、「廊下」、「行く」、「セリ」、「職員室」、「前」を示すラベルデータを付与する。 The label assigning unit 32 assigns a label to the divided video data of each scene output from the scene dividing unit 31 based on the voice recognition result of the commentary voice indicated by the voice recognition result data output from the voice recognition unit 24 ( Step S150). For example, the label assigning unit 32 selects one piece of divided video data that has not yet been labeled, and the start time and the end time of the selected divided video data are selected from the recognition result words indicated by the voice recognition result data. Identify words that have a start or end time within the range. The label giving unit 32 excludes words having a predetermined part of speech such as particles and auxiliary verbs from the specified words, and acquires a word that becomes a label. The label assigning unit 32 adds label data indicating the acquired word to the selected divided video data. The label assigning unit 32 selects the divided video data to which no label has been added yet and repeats the same processing. As a result, for example, the label assigning unit 32 adds the word of the recognition speech recognition result shown in FIG. 4 to the divided video data of the scene at the start time “03: 45.00” and the end time “03: 55.00”. Among them, label data indicating “smile”, “corridor”, “go”, “seri”, “staff room”, “front” is assigned.

映像識別器学習装置１は、解説放送番組の入力がまだある場合（ステップＳ１１０：ＹＥＳ）、ステップＳ１２０〜ステップＳ１６０の処理を繰り返す。解説放送番組の入力がない場合（ステップＳ１１０：ＮＯ）、識別器学習部３３は、各検出対象に対応する識別器３４を、その検出対象の単語がラベルとして付与された分割映像データを用いて学習する（ステップＳ１７０）。そこで、識別器学習部３３は、各分割映像データに付加されたラベルデータを参照して、識別器３４を学習する対象のラベルを抽出する。あるいは、識別器学習部３３は、ユーザから識別器３４を学習する対象のラベルの入力を受けてもよい。識別器学習部３３は、識別器３４を学習する対象のラベルが付与された分割映像データを選択する。例えば、識別器学習部３３は、識別器３４を学習する対象のラベル「笑顔」を含んだラベルデータが付与された分割映像データを全て選択する。識別器学習部３３は、選択した分割映像データから取得した映像の特徴量を用いて、「笑顔」に対応する映像の識別器３４を機械学習により学習する。機械学習には、例えば、サポートベクターマシンやランダムフォレストなどを用いることができるが、他の教師あり学習の手法を用いてもよい。また、映像の特徴量には、任意の１以上の種類の特徴量を用いることができる。例えば、ＳＩＦＴ（Scale-Invariant Feature Transform）特徴量、ＰＣＡ（Principal Component Analysis）−ＳＩＦＴ特徴量、Ｈａａｒ−ｌｉｋｅ特徴量、ＨＯＧ（Histograms of Oriented Gradients）特徴量、ＬＢＰ（Local Binary Pattern）特徴量などを用いることができるが、他の特徴量を用いてもよい。 If there is still an explanation broadcast program input (step S110: YES), the video discriminator learning device 1 repeats the processing from step S120 to step S160. When there is no input of the explanation broadcast program (step S110: NO), the discriminator learning unit 33 uses the discriminator 34 corresponding to each detection target using the divided video data to which the word to be detected is assigned as a label. Learning is performed (step S170). Accordingly, the discriminator learning unit 33 refers to the label data added to each divided video data, and extracts a target label for learning the discriminator 34. Alternatively, the classifier learning unit 33 may receive an input of a target label for learning the classifier 34 from the user. The discriminator learning unit 33 selects the divided video data to which the target label for learning the discriminator 34 is assigned. For example, the classifier learning unit 33 selects all the divided video data to which the label data including the label “smile” to be learned by the classifier 34 is assigned. The discriminator learning unit 33 learns the video discriminator 34 corresponding to “smile” by machine learning using the video feature amount acquired from the selected divided video data. For example, a support vector machine or a random forest can be used for machine learning, but other supervised learning methods may be used. Further, any one or more kinds of feature amounts can be used as the feature amount of the video. For example, SIFT (Scale-Invariant Feature Transform) feature quantity, PCA (Principal Component Analysis) -SIFT feature quantity, Haar-like feature quantity, HOG (Histograms of Oriented Gradients) feature quantity, LBP (Local Binary Pattern) feature quantity, etc. Although it can be used, other feature quantities may be used.

なお、上述した識別器学習処理において、映像識別器学習装置１は、ステップＳ１２０〜ステップＳ１４０の処理と、ステップＳ１５０の処理とを並行して行ってもよく、ステップＳ１５０の処理の後にステップＳ１２０〜ステップＳ１４０の処理を行ってもよい。また、映像識別器学習装置１は、ステップＳ１４０の処理をステップＳ１５０またはステップＳ１６０の処理の後に、あるいは、ステップＳ１５０またはステップＳ１６０の処理と並行して実行してもよく、ステップＳ１４０の処理を行わなくともよい。 In the classifier learning process described above, the video classifier learning apparatus 1 may perform the processes in steps S120 to S140 and the process in step S150 in parallel, and the processes in steps S120 to S120 after the process in step S150. You may perform the process of step S140. Further, the video discriminator learning device 1 may execute the process of step S140 after the process of step S150 or step S160 or in parallel with the process of step S150 or step S160. Not necessary.

映像識別器学習装置１の認識処理部４は、上記の識別器学習処理により学習された識別器３４を用いて放送番組の映像を認識し、映像に含まれるシーン毎の認識結果を出力する。 The recognition processing unit 4 of the video discriminator learning device 1 recognizes the video of the broadcast program using the discriminator 34 learned by the discriminator learning process, and outputs a recognition result for each scene included in the video.

映像にどのようなシーンが含まれているか、また、どのような物体が表示されているかの認識を行う場合、映像識別器学習装置１は、以下のように動作する。
まず、認識対象の放送番組の映像データが映像識別器学習装置１の認識処理部４に入力される。シーン分割部４１は、シーン分割部３１と同様の処理により、認識対象の放送番組の映像データをシーン毎に分割し、分割映像データを出力する。
認識部４２は、全ての識別器３４それぞれを用いて、各分割映像データの認識を行う。例えば、「笑顔」に対応した識別器３４を用いて分割映像データを認識することによって、分割映像データが「笑顔」に対応しているか否かの判定結果が得られる。「笑顔」に対応しているか否かとは、分割映像データに認識対象である「笑顔」のシーンが含まれるか否かを意味する。また、「廊下」に対応した識別器３４を用いて分割映像データを認識することによって、分割映像データが「廊下」に対応しているか否かの判定結果が得られる。「廊下」に対応しているか否かとは、映像中に認識対象である「廊下」が表示されているか否かを意味する。認識部４２は、各分割映像データの認識結果を設定した認識結果データを出力する。例えば、認識結果データには、各分割映像データ（シーン）の開始時刻及び終了時刻と、各分割映像データが対応していると判断された認識対象とが含まれる。 When recognizing what scene is included in a video and what object is displayed, the video discriminator learning device 1 operates as follows.
First, video data of a broadcast program to be recognized is input to the recognition processing unit 4 of the video classifier learning device 1. The scene dividing unit 41 divides the video data of the broadcast program to be recognized for each scene by the same processing as the scene dividing unit 31, and outputs the divided video data.
The recognition unit 42 recognizes each divided video data using all the classifiers 34. For example, by recognizing the divided video data using the discriminator 34 corresponding to “smile”, it is possible to obtain a determination result as to whether or not the divided video data corresponds to “smile”. Whether or not “smile” is supported means whether or not the scene of “smile” to be recognized is included in the divided video data. Further, by recognizing the divided video data using the discriminator 34 corresponding to “corridor”, it is possible to obtain a determination result as to whether or not the divided video data corresponds to “corridor”. Whether or not it corresponds to “corridor” means whether or not “corridor” to be recognized is displayed in the video. The recognition unit 42 outputs recognition result data in which the recognition result of each divided video data is set. For example, the recognition result data includes the start time and end time of each divided video data (scene) and the recognition target that is determined to correspond to each divided video data.

また、キーワードに対応した放送番組の映像を検索する場合、映像識別器学習装置１は、以下のように動作する。
認識処理部４は、各放送番組の映像データに対して上記のように全ての識別器３４を用いた認識処理を行い、放送番組の識別情報と認識結果とを対応付けて内部に備える図示しない記憶部（映像識別器学習装置１に接続される外部の記憶装置でもよい。）に記憶しておく。認識処理部４にキーワードが入力された場合、認識部４２は、記憶部に記憶されている各放送番組の認識結果を、入力されたキーワードにより検索する。なお、キーワードに加えて検索対象の放送番組の情報が入力された場合、認識部４２は、検索対象の放送番組の認識結果を、入力されたキーワードにより検索する。認識部４２は、キーワードに対応するとして特定された分割映像データの開始時刻及び終了時刻と、特定された分割映像データが認識結果に含まれる放送番組の識別情報を取得する。認識部４２は、映像識別器学習装置１の内部または外部に備えるデータベース等から放送番組の識別情報に対応した番組情報を読み出し、読み出した番組情報と分割映像データの情報を認識結果データとして出力する。番組情報は、例えば、放送番組の番組ＩＤや放送番組のタイトル、放送番組の説明など任意とすることができる。また、分割映像データの情報は、分割映像データの開始時刻及び終了時刻でもよく、その開始時刻及び終了時刻の放送番組の映像データやその映像データから抽出した静止画でもよい。 When searching for a broadcast program video corresponding to the keyword, the video discriminator learning device 1 operates as follows.
The recognition processing unit 4 performs the recognition process using all the discriminators 34 on the video data of each broadcast program as described above, and internally associates the broadcast program identification information with the recognition result (not shown). The information is stored in a storage unit (an external storage device connected to the video classifier learning device 1 may be used). When a keyword is input to the recognition processing unit 4, the recognition unit 42 searches the recognition result of each broadcast program stored in the storage unit using the input keyword. When information about a broadcast program to be searched is input in addition to the keyword, the recognition unit 42 searches the recognition result of the broadcast program to be searched with the input keyword. The recognition unit 42 acquires the start time and end time of the divided video data identified as corresponding to the keyword, and broadcast program identification information in which the identified divided video data is included in the recognition result. The recognizing unit 42 reads program information corresponding to broadcast program identification information from a database or the like provided inside or outside the video discriminator learning device 1, and outputs the read program information and information of the divided video data as recognition result data. . The program information can be arbitrary, for example, the program ID of the broadcast program, the title of the broadcast program, or the description of the broadcast program. The information of the divided video data may be the start time and end time of the divided video data, or may be video data of a broadcast program at the start time and end time or a still image extracted from the video data.

また、映像がキーワードに対応しているか否かを検索する場合、映像識別器学習装置１は、以下のように動作する。
まず、認識対象の放送番組の映像データとキーワードが映像識別器学習装置１の認識処理部４に入力される。シーン分割部４１は、シーン分割部３１と同様の処理により、認識対象の放送番組の映像データをシーン毎に分割し、分割映像データを出力する。
認識部４２は、キーワードに対応した識別器３４を用いて各分割映像データの認識を行う。認識部４２は、キーワードに対応した識別器３４を用いた認識により、キーワードに対応していると判断された分割映像データがある場合、その分割映像データの開始時刻及び終了時刻を出力する。また、認識部４２は、キーワードに対応していると判断された分割映像データがない場合、キーワードに対応しない旨を出力する。 When searching for whether or not a video corresponds to a keyword, the video discriminator learning device 1 operates as follows.
First, video data and keywords of a broadcast program to be recognized are input to the recognition processing unit 4 of the video classifier learning device 1. The scene dividing unit 41 divides the video data of the broadcast program to be recognized for each scene by the same processing as the scene dividing unit 31, and outputs the divided video data.
The recognition unit 42 recognizes each divided video data using the classifier 34 corresponding to the keyword. The recognition unit 42 outputs the start time and end time of the divided video data when there is divided video data determined to correspond to the keyword by recognition using the discriminator 34 corresponding to the keyword. Further, when there is no divided video data determined to correspond to the keyword, the recognizing unit 42 outputs that it does not correspond to the keyword.

なお、上記実施形態において、シーン分割部３１及びシーン分割部４１は、映像データをシーン毎に分割しているが、ショット毎、あるいは、所定の時間毎に映像データを分割し、分割映像データとして出力してもよい。
また、映像識別器学習装置１を、シーン分割部３１を備えずに構成することもできる。この場合、ラベル付与部３２は、音声認識結果データが示す認識結果の単語の中から、助詞や助動詞などの所定の品詞の単語を除外してラベルとなる単語を取得する。ラベル付与部３２は、取得した単語を１つずつ選択し、選択した単語を示すラベルデータを、選択した単語の開始時刻または終了時刻から所定だけ前後の時間の映像データに付与する。例えば、選択した単語ｗの開始時刻がｔである場合、映像データの時刻ｔ−ａから時刻ｔ＋ｂに単語ｗを表すラベルデータが付与される（ａ，ｂは０以上）。解説された内容は、解説の後に映像に表われることが多いため、ａ＜ｂとしてもよい。 In the above embodiment, the scene dividing unit 31 and the scene dividing unit 41 divide the video data for each scene, but divide the video data for each shot or for each predetermined time to obtain divided video data. It may be output.
Further, the video classifier learning device 1 can be configured without the scene dividing unit 31. In this case, the label assigning unit 32 obtains a word to be a label by excluding a word of a predetermined part of speech such as a particle or an auxiliary verb from the words of the recognition result indicated by the speech recognition result data. The label assigning unit 32 selects the acquired words one by one, and assigns the label data indicating the selected words to the video data at a predetermined time before and after the start time or end time of the selected word. For example, when the start time of the selected word w is t, label data representing the word w is given from time t-a to time t + b of the video data (a and b are 0 or more). Since the explained contents often appear in the video after the explanation, a <b may be satisfied.

なお、上述した実施形態では、認識対象のコンテンツが動画である場合を説明したが、静止画であってもよい。認識対象のコンテンツが静止画である場合、静止画の画像データは、シーン分割部４１に入力されず、認識部４２に直接入力される。また、識別器学習部３３は、静止画の画像データから抽出可能な画像特徴量を用いて識別器を学習する。 In the above-described embodiment, the case where the content to be recognized is a moving image has been described, but it may be a still image. When the content to be recognized is a still image, the image data of the still image is input directly to the recognition unit 42 without being input to the scene division unit 41. The classifier learning unit 33 learns a classifier using image feature amounts that can be extracted from image data of a still image.

上述した実施形態によれば、映像識別器学習装置１は、解説放送番組に人手をかけることなく識別器学習用のラベルを付与することができる。また、映像識別器学習装置１は、クローズドキャプションを利用した従来の手法に比較して精度よくラベルを付与することが可能である。従って、映像識別器学習装置１は、従来よりも精度のよい識別器を学習することができる。 According to the above-described embodiment, the video discriminator learning device 1 can assign a discriminator learning label without manipulating the explanation broadcast program. In addition, the video discriminator learning device 1 can give a label with higher accuracy than the conventional method using closed captioning. Therefore, the image discriminator learning device 1 can learn a discriminator with higher accuracy than before.

なお、上述の映像識別器学習装置１は、内部にコンピュータシステムを有している。そして、映像識別器学習装置１の動作の過程は、プログラムの形式でコンピュータ読み取り可能な記録媒体に記憶されており、このプログラムをコンピュータシステムが読み出して実行することによって、上記処理が行われる。ここでいうコンピュータシステムとは、ＣＰＵ及び各種メモリやＯＳ、周辺機器等のハードウェアを含むものである。 Note that the video classifier learning device 1 described above has a computer system therein. The operation process of the image discriminator learning device 1 is stored in a computer-readable recording medium in the form of a program, and the above processing is performed by the computer system reading and executing this program. The computer system here includes a CPU, various memories, an OS, and hardware such as peripheral devices.

また、「コンピュータシステム」は、ＷＷＷシステムを利用している場合であれば、ホームページ提供環境（あるいは表示環境）も含むものとする。
また、「コンピュータ読み取り可能な記録媒体」とは、フレキシブルディスク、光磁気ディスク、ＲＯＭ、ＣＤ−ＲＯＭ等の可搬媒体、コンピュータシステムに内蔵されるハードディスク等の記憶装置のことをいう。さらに「コンピュータ読み取り可能な記録媒体」とは、インターネット等のネットワークや電話回線等の通信回線を介してプログラムを送信する場合の通信線のように、短時間の間、動的にプログラムを保持するもの、その場合のサーバやクライアントとなるコンピュータシステム内部の揮発性メモリのように、一定時間プログラムを保持しているものも含むものとする。また上記プログラムは、前述した機能の一部を実現するためのものであってもよく、さらに前述した機能をコンピュータシステムにすでに記録されているプログラムとの組み合わせで実現できるものであってもよい。 Further, the “computer system” includes a homepage providing environment (or display environment) if a WWW system is used.
The “computer-readable recording medium” refers to a storage device such as a flexible medium, a magneto-optical disk, a portable medium such as a ROM and a CD-ROM, and a hard disk incorporated in a computer system. Furthermore, the “computer-readable recording medium” dynamically holds a program for a short time like a communication line when transmitting a program via a network such as the Internet or a communication line such as a telephone line. In this case, a volatile memory in a computer system serving as a server or a client in that case, and a program that holds a program for a certain period of time are also included. The program may be a program for realizing a part of the functions described above, and may be a program capable of realizing the functions described above in combination with a program already recorded in a computer system.

１映像識別器学習装置
２解説音声認識結果抽出部
３識別器学習処理部
４認識処理部
２１解説音声抽出部
２２音響モデル記憶部
２３言語モデル記憶部
２４音声認識部
２５モデル適応化部
３１シーン分割部
３２ラベル付与部
３３識別器学習部
３４識別器
４１シーン分割部
４２認識部 DESCRIPTION OF SYMBOLS 1 Image | video discriminator learning apparatus 2 Explanation voice recognition result extraction part 3 Classifier learning process part 4 Recognition process part 21 Explanation voice extraction part 22 Acoustic model memory | storage part 23 Language model memory | storage part 24 Voice recognition part 25 Model adaptation part 31 Scene division | segmentation Unit 32 Label giving unit 33 Classifier learning unit 34 Classifier 41 Scene division unit 42 Recognition unit

Claims

The first audio signal, which is the audio signal of the program audio, is compared with the second audio signal, which is the audio signal of the program audio, to which the explanation audio is added, and the explanation audio signal which is the audio signal of the explanation audio is extracted. Commentary voice extractor,
A voice recognition unit that recognizes the commentary voice signal extracted by the commentary voice extraction unit;
A word as a label is extracted from a result of speech recognition by the speech recognition unit, and extracted using the extracted word and a feature amount extracted from program video data of a video section corresponding to the utterance time of the word. A discriminator learning processing unit for learning a discriminator for detecting whether or not video data is related to the word from the feature amount of the video data;
A video discriminator learning device comprising:

The classifier learning processing unit
A scene division unit for outputting divided video data obtained by dividing the program video data for each scene;
A label providing unit that extracts a word as a label from a result of speech recognition by the speech recognition unit, and that gives the word as a label to the divided video data at a time corresponding to the utterance time of the extracted word;
A discriminator learning unit that learns the discriminator using the divided video data to which the label imparting unit has a label;
The video discriminator learning device according to claim 1.

A process of adapting an acoustic model used by the speech recognition unit for speech recognition using the commentary speech signal extracted by the commentary speech extraction unit and a result of the speech recognition by the speech recognition unit; and the speech recognition unit An adaptation unit that performs at least one of adaptation of a language model used for speech recognition by using one or more of program closed caption and program information as a result of the speech recognition by the speech recognition unit. In addition,
The video discriminator learning device according to claim 1 or 2.

A recognition unit that recognizes video data using the classifier learned by the classifier learning processing unit;
The video discriminator learning device according to any one of claims 1 to 3, wherein

The first audio signal and the second audio signal are a main audio signal and a sub audio signal of a broadcast program,
The video discriminator learning device according to any one of claims 1 to 4, characterized in that:

Computer
The first audio signal, which is the audio signal of the program audio, is compared with the second audio signal, which is the audio signal of the program audio, to which the explanation audio is added, and the explanation audio signal which is the audio signal of the explanation audio is extracted. Commentary voice extraction means,
Voice recognition means for recognizing the commentary voice signal extracted by the commentary voice extraction means;
A word to be a label is extracted from a result of speech recognition by the speech recognition means, and extracted using the extracted word and a feature amount extracted from program video data of a video section corresponding to the utterance time of the word. Discriminator learning processing means for learning a discriminator for detecting whether or not video data is related to the word from the feature amount of the video data;
A program for causing a video discriminator learning device to function.