JP2004159192A

JP2004159192A - Video summarizing method and program, and storage medium storing video summarizing program

Info

Publication number: JP2004159192A
Application number: JP2002324322A
Authority: JP
Inventors: Makoto Muto; 誠武藤; Yukinobu Taniguchi; 行信谷口; Tadashi Nakanishi; 正仲西
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2002-11-07
Filing date: 2002-11-07
Publication date: 2004-06-03

Abstract

【課題】映像視聴中の視聴行動を分析して視聴者が興味を持った映像区間を抽出し、要約映像のＩＮ／ＯＵＴ点が中途半端な時点に設定されることが少なく、任意の時間長の要約映像を作成でき、要約映像のみから元の映像の内容を把握したり、要約映像の各シーンの選択意図を理解したりすることを可能にする。
【解決手段】本発明は、映像のカット点を含む変化点による映像インデクスの抽出を行い、映像視聴中に行った視聴行動による、視聴行動インデクスを抽出し、映像の時間長の各時点におけるユーザの興味の度合を表す興味強度関数を設定し、抽出された区間から興味強度が所定の閾値以上の区間を抽出し、抽出された区間を連続再生する。
【選択図】図１The present invention analyzes viewing behavior during viewing of a video, extracts a video section in which the viewer is interested, and rarely sets the IN / OUT point of the summary video to a halfway point, and has an arbitrary time length. It is possible to comprehend the summary image of the original video from only the summary video and to understand the intention of selecting each scene of the summary video.
The present invention extracts a video index based on a change point including a cut point of a video, extracts a viewing behavior index based on a viewing behavior performed during video viewing, and extracts a user index at each time point of a video time length. , An interest intensity function indicating the degree of interest is set, an interval where the interest intensity is equal to or greater than a predetermined threshold is extracted from the extracted intervals, and the extracted intervals are continuously reproduced.
[Selection diagram] Fig. 1

Description

【０００１】
【発明の属する技術分野】
本発明は、映像要約方法及びプログラム及び映像要約プログラムを格納した記憶媒体に係り、特に、映像から要約を作成するための映像要約方法及びプログラム及び映像要約プログラムを格納した記憶媒体に関する。
【０００２】
【従来の技術】
今日、一般利用者向けのパーソナルコンピュータや、ハードディスクビデオレコーダにノンリニア編集機能が付き、個人で撮影した映像や番組映像の要約映像を個人で作成することが可能となってきている。
【０００３】
しかしながら、一般的な映像編集ソフトは操作が繁雑なため、一般利用者には扱い辛い。そこで、撮影映像中、または、番組映像視聴中に、お気に入りシーンの時点でボタン押下げ等の操作を行い、その操作履歴に基づいて、自動で要約映像を生成する技術が提案されている。
【０００４】
従来の第１の方法として、映像視聴中にユーザがお気に入りのシーンで押しボタンを押し、その時点から予め設定された任意の値にオフセット値を加算・減算してＩＮ点、ＯＵＴ点等とすることで、お気に入りシーンの編集を行う方法がある（例えば、特許文献１参照）。
【０００５】
また、従来の第２の方法として、映像の各シーンの物理的特徴から、当該シーンの重要度を計算し、重要度が一定の閾値を越えるシーンをつなぎ合わせて要約映像を得る方法がある（例えば、特許文献２参照）。
【０００６】
【特許文献１】
特開２００１−５７６６０
【特許文献２】
特開２００１−１１９６４９
【０００７】
【発明が解決しようとする課題】
しかしながら、上記従来の第１の方法には、ＩＮ点、ＯＵＴ点が映像のカット構成等の面で中途半端な時点に設定される場合があるという問題がある。例えば、ダンスの一連の動きを映したひと続きのシーンの途中がＩＮ点等に設定される場合があり、その場合、一連の動きが要約映像ではわからない。
【０００８】
また、上記の技術は、ユーザによる押しボタン操作回数と、ＩＮ点とＯＵＴ点設定用のオフセット値とによって映像の長さが決まるので、要約映像を作成した後に任意の時間長の要約映像に作り直すことはできないという問題がある。
【０００９】
さらに、映像を見ている間に興味を持ったシーンで押しボタンを押す操作は煩わしく、映像に夢中になって押しボタンを押すのを忘れる場合があるという問題がある。
【００１０】
また、上記従来の第２の方法には、要約映像を他人が視聴した場合に、映像が断片化されているために、内容を把握しずらいという問題がある。例えば、ドラマ番組の場合、単一シーンを見ても当該シーン以前のストーリ展開が分からないので、内容が把握できない。また、各シーンが選択された意図がわからない。例えば、お気に入りの車が映っているシーンを含む要約映像を、お勧め映像として友人に送って当該友人が視聴した場合、当該シーンにおいて車がお勧めなのか、それ以外で映っているものがお勧めなのかがわからない。
【００１１】
本発明は、上記の点に鑑みなされたもので、視聴者が映像を見ているときの自然な視聴行動（発声、ジェスチャ等）を計算機が分析して視聴者が興味を持った映像区間を抽出し、その際に、（１）要約映像のＩＮ点、ＯＵＴ点が、映像のカット構成等の面で中途半端な時点に設定されることが少なく、（２）また、任意の時間長の要約映像を作成でき、（３）また、要約映像のみから元の映像の内容を把握したり、要約映像の各シーンの選択意図を理解したりすることが可能な映像要約方法及びプログラム及び映像要約プログラムを格納した記憶媒体を提供することを目的とする。
【００１２】
【課題を解決するための手段】
図１は、本発明の原理を説明するための図である。
【００１３】
本発明は、映像から要約を作成する映像要約方法において、
入力された映像のカット点を含む変化点による映像インデクスの抽出を行う変化点抽出処理（ステップ１）と、
ユーザが映像視聴中に行った視聴行動による、発声区間抽出、身体動作区間抽出、演奏区間抽出のいずれか、または全部の視聴行動インデクスの抽出を行う視聴行動抽出処理（ステップ２）と、
映像の時間長の各時点におけるユーザの興味の度合を表す興味強度関数を設定する興味強度関数設定処理（ステップ３）と、
視聴行動抽出処理により抽出された区間から興味強度が所定の閾値以上の区間を抽出する興味強度区間抽出処理（ステップ４）と、
興味強度区間抽出処理で抽出された区間を連続再生する再生処理（ステップ５）と、からなる。
【００１４】
また、本発明は、視聴行動抽出処理（ステップ２）において、
視聴行動の強度／長さ、もしくはその両方を計測する視聴行動インデクシング処理を含み、
興味強度関数設定処理（ステップ３）において、
視聴行動の強度／長さに応じて、興味強度関数を連続値として設定し、映像インデクスの変化点で該興味強度を変化させる処理を含む。
【００１５】
また、本発明は、興味強度が所定の閾値以上の区間を抽出して、該区間に視聴行動を合成する処理を含む。
【００１６】
また、本発明は、変化点抽出処理（ステップ１）と、視聴行動抽出処理（ステップ２）で抽出された映像インデクスと視聴行動インデクスとで区間が一致するインデクスを抽出する処理と、
一致するインデクスの開始点及び終了点における興味強度の変化量を増加させる処理を含む。
【００１７】
本発明は、変化点抽出処理（ステップ１）において抽出された映像インデクスの中で、区間と特徴量が一致するインデクスを抽出する処理と、
興味強度が所定の閾値以上の区間を抽出して、該区間に区間と特徴量が一致するインデクスに対応するすべての視聴行動を合成する処理を含む。
【００１８】
本発明は、映像から要約を作成する映像要約プログラムであって、
入力された映像のカット点を含む変化点による映像インデクスの抽出を行う変化点抽出ステップと、
ユーザが映像視聴中に行った視聴行動による、強度／長さまたは、その両方を計測することにより、発声区間抽出、身体動作区間抽出、演奏区間抽出のいずれか、または全部の視聴行動インデクスの抽出を行う視聴行動抽出ステップと、
視聴行動の強度／長さに応じて、興味強度関数を連続値として設定し、映像インデクスの変化点で該興味強度を変化させる興味強度関数設定ステップと、
変化点抽出ステップと、視聴行動抽出ステップで抽出された映像インデクスと視聴行動インデクスとで区間が一致するインデクスを抽出する区間一致インデクス抽出ステップと、
インデクスの開始点及び終了点における興味強度の変化量を増加させる変化量増加ステップと、
変化点抽出ステップにおいて抽出された映像インデクスの中で、区間と特徴量が一致するインデクスを抽出する特徴量一致インデクス抽出ステップと、
興味強度が所定の閾値以上の区間を抽出して、該区間に区間と特徴量が一致するインデクスに対応するすべての視聴行動を合成するステップと、
抽出された映像区間に該抽出された映像区間の視聴行動を合成して、連続再生する再生ステップと、を実行する。
【００１９】
本発明は、映像から要約を作成する映像要約プログラムを格納した記憶媒体であって、
入力された映像のカット点を含む変化点による映像インデクスの抽出を行う変化点抽出ステップと、
ユーザが映像視聴中に行った視聴行動による、強度／長さまたは、その両方を計測することにより、発声区間抽出、身体動作区間抽出、演奏区間抽出のいずれか、または全部の視聴行動インデクスの抽出を行う視聴行動抽出ステップと、
視聴行動の強度／長さに応じて、興味強度関数を連続値として設定し、映像インデクスの変化点で該興味強度を変化させる興味強度関数設定ステップと、
変化点抽出ステップと、視聴行動抽出ステップで抽出された映像インデクスと視聴行動インデクスとで区間が一致するインデクスを抽出する区間一致インデクス抽出ステップと、
インデクスの開始点及び終了点における興味強度の変化量を増加させる変化量増加ステップと、
変化点抽出ステップにおいて抽出された映像インデクスの中で、区間と特徴量が一致するインデクスを抽出する特徴量一致インデクス抽出ステップと、
興味強度が所定の閾値以上の区間を抽出して、該区間に区間と特徴量が一致するインデクスに対応するすべての視聴行動を合成するステップと、
抽出された映像区間に該抽出された映像区間の視聴行動を合成して、連続再生する再生ステップと、からなるプログラムを格納する。
【００２０】
上記のように、本発明では、映像のカット点等の抽出を行い、発声等の視聴行動のインデクシング処理を行い、映像のインデクスの変化点で興味強度を変化させる関数を設定し、興味強度が閾値以上の区間を抽出し、抽出区間の連続再生を行うことにより、要約映像のＩＮ点、ＯＵＴ点が、映像のカット構成等の面で中途半端な時点に設定されることが少なくなる。
【００２１】
また、本発明では、映像のカット点等の抽出を行い、発声等の視聴行動の強度／長さ、もしくは、その両方を計測し、視聴行動の強度／長さに応じて、興味強度関数を連続値として設定、映像インデクスの変化点で興味強度を変化させる興味強度関数を設定し、興味強度が閾値以上の区間を抽出し、抽出区間の連続再生を行うことにより、任意の時間長の要約映像を作成することが可能となる。
【００２２】
また、本発明では、映像のカット点等の抽出を行い、発声等の視聴行動のインデクシング処理を行い、発声等の視聴行動のインデクシング処理を行い、映像インデクスの変化点で興味強度を変化させる興味強度関数を設定し、興味強度が閾値以上の区間を抽出し、抽出区間に視聴行動を合成し、抽出区間の連続再生を行うことにより、要約映像のみから元の映像の内容を把握したり、要約映像の各シーンの選択意図を理解したりすることが可能となる。
【００２３】
【発明の実施の形態】
以下、図面と共に本発明の実施の形態を説明する。
【００２４】
図２は、本発明の一実施の形態におけるシステム構成を示す。同図に示すシステムには、モニタ５０１、マイク５０２、カメラ５０３、鍵盤５０４、押しボタン５０５及び計算機５０６から構成される。
【００２５】
モニタ５０１には、番組映像などが表示され、ユーザが視聴する。
【００２６】
マイク５０２は、ユーザが映像視聴中に行う発声を記録する。例えば、スポーツ番組のファインプレーのシーンで「すごい」という発声を記録する。
【００２７】
カメラ５０３は、ユーザのジェスチャ等の身体動作を記録する。例えば、スポーツ番組でユーザのお気に入りの選手の勝利の瞬間のシーンでユーザがガッツポーズをする様子を記録する。
【００２８】
鍵盤５０４は、ユーザの演奏を記録する。例えば、ドキュメンタリ番組で、出演者が何等かの成功をしたシーンで、ユーザが行うお祝いの演奏を記録する。
【００２９】
押しボタン５０５は、ユーザの押しボタン操作を記録する。ここで、当該押しボタン５０５は、ＯＮ，ＯＦＦの２値情報の他に、ボタンの押し下げ圧力と、押し下げ継続時間を記録することができるものである。例えば、料理番組で、材料のフリップが提示されるシーンでユーザが行う押しボタン操作を記録する。
【００３０】
計算機５０６は、モニタ５０１への映像の送出と、マイク５０２、カメラ５０３、鍵盤５０４からの信号入力と、各種処理を行う。
【００３１】
このシステムでは、ユーザは各種視聴行動（発声、身体動作等）をマイク等で取得し、当該視聴行動が発生した映像区間を重要なシーンとして要約映像を生成する。
【００３２】
視聴行動としては、以下のようなものが考えられる。
【００３３】
・発声：
例：「凄い」、「うまそうだ」、「それだね」、歌唱
・身体動作：
例：ガッツポーズ、ウィンク、跳躍
・楽器の演奏：
例：ギターの即興演奏、「君が代」のピアノ演奏
・押しボタンの押し下げ
これらの視聴行動の中で、「発声」、「身体動作」、「楽器の演奏」は、ユーザが映像視聴中に自然に行う行動であり、「押しボタンの押し下げ」と比べ、ユーザへの精神的負担が少ないことから、要約映像生成のために取得する視聴行動として望ましい。
【００３４】
［第１の実施の形態］
図３は、本発明の第１の実施の形態における動作のフローチャートである。
【００３５】
ステップ１０１）入力映像のカット点抽出、音声区間抽出、音楽区間抽出等のインデクシング処理を行う。このための処理としては、例えば、［谷口行信、外村佳伸、浜田洋：映像ショット切り換え検出方法とその映像アクセスインタフェースへの応用，電子情報通信学会論文誌、Ｖｏｌ．Ｊ７９−Ｄ２，Ｎｏ．４，ｐｐ．５３８−５４６］、［南憲一、阿久津明人、浜田洋、外村佳伸：音情報を用いた映像インデクシングとその応用，電子情報通信学会論文誌、Ｖｏｌ．Ｊ８１−Ｄ２，Ｎｏ．３，ｐｐ．５２９−５３７］等を用いることができる。ここでは、映像処理により検出可能なインデクスを例として挙げたが、字幕放送の情報等に基づいてニュース項目の切れ目やドラマのシーンを検出してインデクスとすることもできる。
【００３６】
ステップ１０２）ユーザが映像視聴中に行った発声・身体動作・演奏等に対する発声区間抽出、身体動作区間抽出、演奏区間抽出等のインデクシング処理を行う。詳細については後述する。
【００３７】
また、上記のステップ１０１とステップ１０２の順序は逆であってもよい。また、ステップ１０１とステップ１０２は同時に行ってもよい。
【００３８】
ステップ１０３）図４のような興味強度関数を設定する。興味強度関数は、映像の時間長の各時点におけるユーザの興味の度合を表すものであり、０から１の値をとり、値が大きいほど興味が強いことを示す。興味強度関数の推定方法については後述する。
【００３９】
ステップ１０４）図５のような興味強度が一定の閾値以上の区間（シーン１〜３）を抽出する。閾値はユーザが任意に設定する。また、ユーザが要約映像時間を任意に設定し、要約映像時間に応じて閾値を決定してもよい。
【００４０】
ステップ１０５）ステップ１０４において抽出した映像区間の映像に、視聴行動を合成する。詳細については後述する。
【００４１】
ステップ１０６）ステップ１０４において抽出した区間を連続再生する。これが要約映像である。また、抽出区間の映像をつなぎ合わせて一つの映像ファイルとして要約映像を作成してもよい。また、抽出区間の時間情報のリストとして要約映像を実現してもよい。
【００４２】
次に、上記のステップ１０２における視聴行動インデクシング処理について詳細に説明する。
【００４３】
図６は、本発明の第１の実施の形態における視聴行動インデクシング処理のフローチャートである。
【００４４】
ステップ２０１）視聴行動の種別（発声・身体動作・演奏・ボタン押し下げ等）を判定する。これは、マイク５０２、カメラ５０３、鍵盤５０４、押しボタン５０５等から得た信号を、計算機５０５において別々のファイルとして保存しておき、解析対象がどのファイルかを判定することで、視聴行動の種別を判定する。また、マイク５０２、カメラ５０３、鍵盤５０４、押しボタン５０５等から得た信号を同一のファイル中の別々のトラックに保存しておき、解析対象がどのトラックのものかを判定することで、視聴行動の種別を判定してもよい。
【００４５】
ステップ２０２）ステップ２０１で判定した種別が発声かどうかを判定する。発声である場合は、ステップ２０３に移行する。発声でない場合には、ステップ２０５に移行する。
【００４６】
ステップ２０３）発声の音声信号の強度計算を行う。発声の音声信号の強度計算は、音声信号のパワー値とする。強度計算を音声信号のパワーや音声の発話測度、声の大きさ等に基づいて計算してもよい。
【００４７】
ステップ２０４）ステップ２０３で得られた音声強度が一定の閾値以上の時間区間を抽出し（図７）、発声区間（Ｌ）とする。この方法は、［南憲一，阿久津明人，浜田洋、外村佳伸：音情報を用いた映像インデクシングとその応用，電子情報通信学会論文誌，Ｖｏｌ．Ｊ８１−Ｄ２，Ｎｏ．３，ｐｐ．５２９−５３７］等に開示されている方法を用いることができる。ここで、発声区間の時間長が一定時間以上の場合のみ発声区間と判断するようにしてもよい。
【００４８】
ステップ２０５）ステップ２０１で判定した種別が身体動作かどうかを判定する。身体動作である場合には、ステップ２０６に移行する。身体動作でない場合には、ステップ２０８に移行する。
【００４９】
ステップ２０６）身体動作の強度計算を行う。カメラ５０３で撮影した映像のフレーム間の差分値を身体動作の強度とする。フレーム間の差分値の計算は、カット点検出等で用いられる手法［谷口行信、外村佳伸、浜田洋：映像ショット切り換え検出方法とその映像アクセスインタフェースへの応用，電子情報通信学会論文誌、Ｖｏｌ．Ｊ７９−Ｄ２，Ｎｏ．４，ｐｐ．５３８−５４６］を用いるものとする。
【００５０】
ステップ２０７）ステップ２０６で得られた身体動作強度が一定の閾値以上の時間区間を身体動作区間とする。ここで、身体動作区間の時間長が一定時間以上の場合のみ身体動作区間と判断するようにしてもよい。
【００５１】
ステップ２０８）ステップ２０１で判定した種別が演奏かどうかを判定する。演奏である場合には、ステップ２０９に移行し、演奏でない場合にはステップ２１１に移行する。
【００５２】
ステップ２０９）演奏の強度計算を行う。演奏の音響信号のパワーを計算し、演奏強度とする。また、演奏のテンポ、打鍵速度等に基づいて強度計算を行ってもよい。
【００５３】
ステップ２１０）ステップ２０９で得られた演奏強度が一定の閾値以上の時間区間を抽出し、演奏区間とする。ここで、演奏区間の時間長が一定時間以上の場合のみ演奏区間と判断するようにしてもよい。
【００５４】
ステップ２１１）ステップ２０１で判定した種別がボタン押し下げかどうか判定する。ボタン押し下げである場合は、ステップ２１２に移行する。ボタン押し下げでない場合には、ステップ２１６に移行する。
【００５５】
ステップ２１２）ボタン押し下げの強度計算を行う。ボタン押し下げの圧力を取得し、押し下げの強度とする。また、押し下げの速度検出や連射検出に基づいて強度計算を行ってもよい。例えば、ボタンの連射をした場合、通常の押し下げよりも高い強度値とすることが考えられる。
【００５６】
ステップ２１３）ステップ２０９で得られたボタン押し下げ強度が一定の閾値以上の時間区間を抽出し、ボタン押し下げ区間とする。
【００５７】
ステップ２１４）ステップ２０４または、ステップ２０７または、ステップ２１０または、ステップ２１３で抽出された区間を、後の処理のために視聴行動区間のリストとして保存する。視聴行動区間は、例えば、区間の開始時刻・終了時刻、視聴行動種別の組として表現できる。
【００５８】
ステップ２１５）すべての視聴行動について処理が終了したかどうかの判定を行う。終了した場合は当該処理を終了する。終了していない場合は、ステップ２０１に移行し、別の視聴行動について処理を行う。
【００５９】
ステップ２１６）ユーザにエラーの通知を行い、処理を終了する。また、発声・身体動作・演奏・ボタン押し下げ以外の視聴行動を扱う場合は、エラー通知を行わずに、当該行動に関して発声等と同様の処理を行ってもよい。
【００６０】
次に、上記のステップ１０３の興味強度関数設定処理について詳細に説明する。
【００６１】
図８は、本発明の第１の実施の形態における興味強度関数設定処理のフローチャートである。
【００６２】
以下では、すべの視聴行動区間について、ステップ３０１からステップ３０９の処理を繰り返す。
【００６３】
ステップ３０１）視聴行動を一つ選び、その行動種別（発声・身体動作・演奏・押しボタン等）を判定する。これは、マイク５０２、カメラ５０３、鍵盤５０４、押しボタン５０５等から得た信号を、計算機５０５において別々のファイルとして保存しておき、解析対象がどのファイルかを判定することで、視聴行動の種別を判定する。また、マイク５０２・カメラ５０３、鍵盤５０４、押しボタン５０５等から得た信号を同一のファイル中の別々のトラックに保存しておき、解析対象がどのトラックのものかを判定することで、視聴行動の種別を判定してもよい。
【００６４】
ステップ３０２）興味強度の極大値を表すＨ値を次式によって計算する。
【００６５】
【数１】

Ｈ値は、当該視聴行動の記録からユーザの興味の度合を推定した値であり、０から１の値を取り、値が大きいほど興味が強いことを示す。
【００６６】
ここで、ｃｋ値は、表１のように各視聴行動種別に応じて定められており、当該視聴行動種別と興味の度合との関連の強さに相対的に示したものである。
【００６７】
【表１】

例えば、表１のような値の場合、興味の度合と発声との関連は、ジェスチャとの関連よりも強いことを示す。
【００６８】
Ｌ値は、視聴行動区間の時間長である（図７）。視聴行動の時間が長いほど、Ｈ値が大きくなり、興味強度が大きくなる。例えば、ユーザがジェスチャを長時間行った場合は、興味強度が大きい。
【００６９】
Ｐ値は、音声信号の強度や、身体動作強度等の各種視聴行動の強度の時間平均値である（図７）。各種視聴行動の強度が大きいほど、Ｈ値が大きくなり、興味強度が大きい。例えば、ユーザが大きい声で発声した場合は、興味強度が大きい。Ｃ値は適当な定数である。Ｈ値の計算にシグモイド関数を用いたのは、値の範囲を０から１の範囲にするためである。また、０から１の値を返す任意の関数を用いてもよい。
【００７０】
以下のステップ３０３からステップ３０９までの処理を図９を用いて説明する。
【００７１】
ステップ３０３）視聴行動区間の開始点からｔ秒間だけ時間的に遡った時点から視聴行動区間の終了点までの区間の興味強度をＨと設定する。ここでは、ｔ値は表２のように、各視聴行動種別に応じて定められている。
【００７２】
【表２】

ｔ値は、映像中の興味深いシーンの出現から、ユーザが視聴行動をとるまでの時間的な遅れを表す。例えば、表２の場合は、興味深いシーンが出現してから、０．５秒後に、ユーザが発声することが多いことを示す。
【００７３】
ステップ３０４）視聴行動区間から時間的に遡り、映像のインデクス点で興味強度をｄ値だけ減算する。ここで、映像のインデクス点とは、映像のカット点、音声区間の開始点・終了点、音楽区間の開始点・終了点等を表す。ｄ値は、表３のように各映像インデクス、各視聴行動種別に応じて定められている。
【００７４】
【表３】

これは、映像のインデクス点では映像の内容面で変化が生じる場合が多く、そのために興味の度合も同様に変化していると考えられるため、映像のインデクス点において興味強度を減算する。ｄ値の大きさは、各インデクス点の種別と興味強度の変化との間の関連の強さを示す。ｄ値が大きいほど、当該インデクス点において興味強度の変化が大きい。ここで、図９のように複数の視聴行動に基づいて設定された興味強度関数が時間的に重なった区間は、興味強度が大きい方の値を当該区間の興味強度とする。また、興味強度を平均して当該区間の興味強度としてもよい。
【００７５】
ステップ３０５）興味強度が０以上かを判定する。０以上の場合は、ステップ３０６に移行する。０以上でない場合は、ステップ３０８に移行する。
【００７６】
ステップ３０６）さらに、時間的に遡り、ステップ３０４と同様に映像のインデクス点で興味強度をｄ値だけ減算する。
【００７７】
ステップ３０７）興味強度が０以上かを判定する。０以上の場合は、ステップ３０８に移行し、０以上でない場合には、ステップ３０６に移行する。
【００７８】
ステップ３０８）０未満の興味強度を０とする。これは、興味強度を０から１の範囲の値にするために行う。
【００７９】
ステップ３０９）視聴行動の時間的に後の区間についても、同様にステップ３０４以降の処理を行う。
【００８０】
ステップ３１０）全ての視聴行動に対して興味強度関数の設定処理を終了したかどうかを判定する。終了した場合は当該処理を終了する。終了していない場合には、ステップ３０１に移行し、別の視聴行動に対して興味強度関数の設定処理を行う。
【００８１】
次に、上記のステップ１０５における抽出区間に視聴行動合成する処理について説明する。
【００８２】
図１０は、本発明の第１の実施の形態における視聴区間に視聴行動合成する処理のフローチャートである。
【００８３】
ステップ４０１）ステップ１０４で抽出した映像区間のＩＮ点、ＯＵＴ点を取得する。
【００８４】
ステップ４０２）視聴行動の記録（発声・ジェスチャ等）に対して、ステップ４０１で取得したＩＮ点・ＯＵＴ点で切り出し処理を行う。ここで、映像と視聴行動のタイムコードは同期がとれているものとする。例えば、映像視聴中の発声を合成する場合、映像視聴中の発声のタイミングと同じタイミングでの発声区間が抽出される。
【００８５】
ステップ４０３）ステップ４０２で切り出した視聴行動を映像に合成する処理を行う。視聴行動の種別が「発声」の場合は、映像の音声信号に発声の音声信号を加算して合成する。また、音声の別トラックに、発声の音声信号を記録し、要約映像再生時に合成してもよい。また、発声の時間遅れを考慮して、表２の時間だけ時間的に早めてもよい。
【００８６】
視聴行動の種別が「身体動作」の場合は、映像の画面中に小画面を設け、表示する。また、映像と、身体動作の映像とを交互に切り換えて表示してもよい。また、身体動作の時間遅れを考慮して、表２の時間だけ時間的に早めて合成してもよい。
【００８７】
視聴行動の種別が「演奏」の場合は、映像の音声信号に演奏の音声信号を加算して合成する。また、音声の別トラックに、演奏の音響信号を記録し、要約映像再生時に合成してもよい。また、演奏の時間遅れを考慮して、表２の時間だけ時間的に早めて合成してもよい。
【００８８】
視聴行動の種別が「押しボタン」の場合は、ステップ４０４に移行する。また、映像にボタン印等を合成してもよい。また、押しボタン操作の時間遅れを考慮して、表２の時間だけ時間的に早めて合成してもよい。
【００８９】
ステップ４０４）全ての視聴行動について処理が終了したかどうかの判定を行う。終了した場合は当該処理を終了する。終了していない場合にはステップ４０１に移行し、別の視聴行動について処理を行う。
【００９０】
上記のように、図３に示すステップ１０３とステップ１０４及び、図８に示す興味強度関数設定処理により、映像の切れ目となるカット点等で興味強度を不連続に変化させることで、興味強度が閾値以上の区間を抽出する処理において（ステップ１０４）、図５に示すように、要約映像のＩＮ点、ＯＵＴ点が映像の切れ目と一致する。
【００９１】
また、図６に示す興味強度計算及び図３のステップ１０４に示す処理により、視聴行動の強度（発声の大きさや継続時間に基づいて計算する）から、重要シーンの興味強度を０から１の連続値として計算する。その際に、視聴行動の強度が大きい場合に、興味強度が大きく、視聴行動の強度が小さい場合に、興味強度が小さくなるように計算する。そして、図５に示すように閾値以上の区間を要約映像のシーンとすることにより、要約映像の時間長を可変にできる。
【００９２】
また、図１０のステップ１０５により、映像から切り出された重要シーンにそのシーンにおいて時間的に同期して行われた視聴行動（発声等）を当該シーンに合成することが可能となる。
【００９３】
［第２の実施の形態］
本実施の形態は、前述の第１の実施の形態における興味強度関数設定を、インデクス点以外の区間で一定値とするのではなく、図１１のように視聴行動からの時間差に比例して値を減少させるものである。
【００９４】
前述のステップ１０３における興味強度設定処理を説明する。
【００９５】
図１２は、本発明の第２の実施の形態における興味強度設定処理のフローチャートである。同図において、前述の図８と同一動作については、同一ステップ番号を付与してその説明を省略する。
【００９６】
ステップ３０１〜ステップ３０２は、前述の第１の実施の形態と同様である。
ステップ６０３）興味強度関数の初期設定を行う。詳細については後述する。
【００９７】
ステップ３０４〜ステップ３１０は、前述の第１の実施の形態と同様である。
次に、上記のステップ６０３の興味強度関数設定処理について説明する。
【００９８】
図１３は、本発明の第２の実施の形態における興味強度関数設定処理のフローチャートである。
【００９９】
ステップ７０１）視聴行動区間の興味強度をＨと設定する。
【０１００】
ステップ７０２）視聴行動区間以外の区間について、視聴行動区間からの時間差に比例して、興味強度を減算する。興味強度の傾きは適当に設定する。また、視聴行動の種別に応じて当該傾きを別々に設定してもよい。また、視聴行動区間から時間差が大きくなるにつれて減少する任意の関数（例：ガウス関数）を用いてもよい。
【０１０１】
ステップ７０３）０未満の興味強度を０にする。これは、興味強度を０から１の範囲の値にするために行う。
【０１０２】
本実施の形態では、図３のステップ１０４において閾値の増減に対して要約映像の時間長が連続的に増減するので、第１の実施の形態と比べて要約映像を任意の長さに設定できるという特徴がある。なぜなら、第１の実施の形態では、図４のような階段状の興味強度関数になるため、閾値を連続的に増減させても生成される要約映像の時間長は不連続にしか変化しないのに対し、本実施の形態では、興味強度関数に連続的に変化する区間があるため、閾値の増減に対して要約映像の時間長が連続的に変化するからである。但し、本実施の形態では、要約映像の一部で、ＩＮ点、ＯＵＴ点が、映像のカット構成等の面で中途半端な時点に設定される可能性が高くなるという問題がある。
【０１０３】
上記の第２の実施の形態において、図１３のステップ７０２により、興味強度が図１１に示すように、時間方向に連続的な変化が生まれるので、ステップ１０４の重要シーン抽出処理における閾値の増減に対して要約映像の時間長も同様に連続的に増減する（但し、一部で不連続に変化する）。
【０１０４】
［第３の実施の形態］
本実施の形態では、第１の実施の形態において、映像インデクスと視聴行動のインデクスとの間で、区間が一致するものがある場合に、興味強度関数設定におけるｄ値を大きくすることにより、当該区間が要約映像の１シーンとなりやすくするものである。例えば、映像インデクスの音楽区間と、視聴行動のインデクスの発声区間が一致する場合は、当該音楽区間でユーザが音楽に合わせて歌唱したと考えられるので、当該視聴行動に関する興味強度をｄ値を大きくすることによって当該音楽区間に局在させる。これにより、当該区間が要約映像の１シーンとなりやすくなる。
【０１０５】
図１４は、本発明の第３の実施の形態における動作のフローチャートである。
同図において、図３のフローチャートと同一動作については、同一ステップ番号を付し、その説明を省略する。
【０１０６】
ステップ１０１〜ステップ１０２は、第１の実施の形態と同様である。
【０１０７】
ステップ８０３）ステップ１０１で得た映像インデクスと、ステップ１０２で得た視聴行動インデクスとの間で区間が一致するものを検出する。区間が一致するとは、区間の開始・終了時刻がある誤差の範囲内で一致することを意味する。
【０１０８】
ステップ８０４）ステップ８０３で一致する区間が検出されたかどうかを判定する。一致区間が有る場合は、ステップ８０５に移行し、一致区間がない場合は、ステップ１０３に移行する。
【０１０９】
ステップ８０５）ｄ値を表４のようなｄ’のものに更新する。
【０１１０】
【表４】

ｄ’は、ｄと比べて大きい値を予め設定しておく。また、ｄ値に一定の値を加算してもよい。また、ｄ値に一定の値を掛けてもよい。また、前述の加算値、掛ける値を、インデクスの種別と視聴行動の種別に応じて別々に設定してもよい。図１５は、本発明の第３の実施の形態で生成される興味強度関数の例である。
【０１１１】
ステップ１０３〜ステップ１０６は、第１の実施の形態と同様である。
【０１１２】
上記の第３の実施の形態において、図１４のステップ８０３〜ステップ８０５により、映像インデクスと視聴インデクスとで、区間が一致するものを検出し（ステップ８０３）、当該区間の開始点、終了点における興味強度関数の減算値を、通常の値よりも大きくする（ステップ８０５）により、図１５に示すように、興味強度関数の形状が高層ビル型になり、ステップ１０４で重要シーンを抽出する際に、当該区間がそのまま抽出される可能性が大きくなる。
【０１１３】
［第４の実施の形態］
本実施の形態は、前述の第１の実施の形態において、映像インデクスで、区間長、特徴量が一致する映像区間群が存在する場合に、当該映像区間群に同期する区間群で複数の視聴行動（Ａ）が存在する場合に、任意の一つの視聴行動（Ｂ）を除く視聴行動に関する興味強度を小さくして、要約映像のシーンとして採用されにくくし、また、映像への視聴行動合成時に、（Ｂ）が元となって切り出された映像区間に複数の視聴行動（Ａ）を合成するものである。
【０１１４】
例として、映像中に同一のＢＧＭが２つの区間で演奏され、ユーザが当該２つの区間でＢＧＭに合わせて歌唱した場合、片方の映像区間は要約映像には用いないが、もう片方の区間へ、２つの歌唱を合成する。これにより、要約映像では１つのＢＧＭ区間で、２つの歌唱が重複して合成されたものが得られる。より一般的には、１つのシーンに、複数の視聴行動が合成された映像が得られる。これにより、要約映像の１つのシーンで、複数の視聴行動が視聴できる。ユーザは、当該シーンに関するより多くの付加情報を得られるという特徴がある。
【０１１５】
図１６は、本発明の第４の実施の形態における動作のフローチャートである。同図において、図３のフローチャートと同様の動作には、同一ステップ番号を付し、その説明を省略する。
【０１１６】
ステップ１０１〜ステップ１０２は第１の実施の形態と同様である。
【０１１７】
ステップ１１０３）映像インデクス中の一致インデクスを検出する。全映像インデクスのすべての２つの組み合わせについて、時間長と各種視聴行動の強度値との類似度を計算し、類似度が一定の閾値以上である場合には、一致インデクスとして検出する。また、視聴行動の時間長や強度値の他の任意の特徴量を用いてもよい。また、類似度計算の手法は、時間長等の差分を計算し、逆数をとることなどの方法がある。
【０１１８】
ステップ１１０４）ステップ１１０３で一致インデクスが検出されたかどうかを判定する。検出された場合はステップ１１０５に移行する。検出されなかった場合は、ステップ１０３に移行する。
【０１１９】
ステップ１１０５）ステップ１１０３で検出した一致インデクスを保存する。
【０１２０】
ステップ１０３は、第１の実施の形態と同様である。
【０１２１】
ステップ１１０７）一致インデクス区間の興味強度抑制処理を行う。興味強度関数で、ステップ１１０５において保存したインデクスに含まれるインデクスが元となって生成されたものに関して、図１７のように興味強度関数に一定の値を掛けて興味強度を小さくする。その際に掛ける値は適当に設定する。また、視聴行動の種別に応じて、掛ける値を別々に設定してもよい。但し、一致インデクス群のうち、任意の一つが元になって生成された興味強度関数に関しては値はそのままとする。
【０１２２】
ステップ１０４は、第１の実施の形態と同様である。
【０１２３】
ステップ１１０９）抽出区間に視聴行動合成処理を行う。詳細は後述する。
ステップ１０６は、第１の実施の形態と同様である。
【０１２４】
以下にステップ１１０９の抽出区間に視聴行動を合成する処理について説明する。
【０１２５】
図１８は、本発明の第４の実施の形態における視聴行動合成処理のフローチャートであり、図１９は、本発明の第４の実施の形態における視聴行動合成処理を説明するための図である。
【０１２６】
図１８において、前述の図１０に示すフロチャートと同一動作については、同一ステップ番号を付し、その説明を省略する。
【０１２７】
ステップ４０１〜ステップ４０２は、第１の実施の形態と同様である。
【０１２８】
ステップ１２０３）ステップ４０２で抽出した区間が一致インデクスかどうかを判定する。当該区間が図１６のステップ１１０５で保存したものに含まれる場合は、一致インデクスであり、含まれない場合は一致インデクスでない。
【０１２９】
ステップ１２０４）視聴行動を映像へ合成する際のＩＮ点とＯＵＴ点を変換する。図１９のように、興味強度が抑制されていない興味強度関数の元となった映像インデクスの区間の先頭を合成時のＩＮ点とする。また、当該ＩＮ点に視聴行動の区間長を加算した点を合成時のＯＵＴ点とする。
【０１３０】
ステップ４０３〜ステップ４０４は、第１の実施の形態と同様である。
【０１３１】
上記の第４の実施の形態の図１６のステップ１１０３〜ステップ１１０５、１１０７、１１０９により、映像インデクスの中で、区間／特徴量が一致する組み合わせを検出し（Ａ）（ステップ１１０３）、その中の任意の１つのインデクス（Ｂ）を除くものの興味強度関数を図１７に示すように小さくする（ステップ１１０７）。これにより、ステップ１０４で重要シーンを抽出する際に、（Ｂ）のみが重要シーンとして抽出される可能性が大きくなる。次に、ステップ１１０９で、図１９のように、（Ａ）の区間で行われた視聴行動を（Ｂ）の区間に重複させて合成することができる。
【０１３２】
なお、上記の第１〜第４の実施の形態における図３、図６、図８、図１０、図１２、図１３、図１４、図１６、図１８に示すフローチャートをプログラムとして構築し、映像要約装置として利用されるコンピュータにインストールすることも可能である。また、ネットワークを介して流通させることも可能である。
【０１３３】
さらに、構築されたプログラムを、映像要約装置として利用されるコンピュータのハードディスク装置や、フレキシブルディスクやＣＤ−ＲＯＭ等の可搬記憶媒体に格納しておき、本発明を実施する際に、当該装置にインストールして実行することも可能である。
【０１３４】
なお、本発明は、上記の実施の形態に限定されることなく、特許請求の範囲内において、種々変更・応用が可能である。
【０１３５】
【発明の効果】
上述のように、本発明によれば、ユーザの視聴行動に基づいた要約映像を生成することが可能になる。
【０１３６】
本発明では、要約映像中の各重要シーンが、映像カット点や音声区間開始・終了点等映像構成の面で切れ目の自然な要約映像ができる。ここで、要約映像の各重要シーンにつなぎ目が、映像の切れ目と一致していることが自然な要約映像であると考える。例えば、ダンスの一連の動きを映したひと続きのシーンが映像中にあって、当該シーンが重要シーンとして抽出される場合、当該シーンの途中にＩＮ点、ＯＵＴ点が設定されることは無く、当該シーンのＩＮ点、ＯＵＴ点や、その前後のカット点等が重要シーンのＩＮ点、ＯＵＴ点となり、区切れの自然な要約映像を生成できる。
【０１３７】
また、本発明によれば、ユーザが興味強度の度合を指定して要約映像の構成を変えることができる。また、要約映像の時間長を大まかに指定して要約映像を生成できる。
【０１３８】
さらに、興味強度の閾値の増減が可能になることにより、要約映像の時間長をより高精度に設定できる。
【０１３９】
また、映像から切り出された重要シーンに、そのシーンにおいて時間的に同期して行われた視聴行動を当該シーンに合成することいより、要約映像で、元の映像と視聴行動とを同時に視聴できる。例えば、ある観光地の紹介のシーンに「去年の夏に行ったなあ」という発声を付加することで、後の視聴時での記憶想起の支援となり、また、友人等へ送った場合は、当該シーンのコメントとしての役割を果たす。
【０１４０】
また、映像インデクスと視聴インデクスとで区間が一致するものを検出し、当該区間の開始点、終了点における興味強度関数の減算値を通常の値よりも大きくすることにより、ユーザが映像のカット構成、音楽区間等に対して区間の点で意図的な視聴行動を行った場合に、当該意図を反映した重要シーンから構成される要約映像ができる。例えば、映像中で演奏された楽曲に合わせてユーザが歌唱した場合、当該歌唱区間が要約映像の１シーンとなる可能性が高い（当該区間の前後に余計なシーンが付加されることがない）。
【０１４１】
また、区間／特徴量が一致する組み合わせを検出することにより、映像にアナウンスやＢＧＭ等の面で冗長な内容がある場合、当該冗長な内容がない要約映像を生成でき、例えば、映像中で同一のＢＧＭが２ヵ所で演奏され、ユーザが両方の区間で歌唱した場合に、要約映像では１つのＢＧＭ区間で、２つの歌唱が重複して合成されたものが得られ、元来２つのシーンで行った視聴行動を、要約映像では１つのシーンでまとめて視聴できる。
【図面の簡単な説明】
【図１】本発明の原理を説明するための図である。
【図２】本発明の一実施の形態におけるシステム構成図である。
【図３】本発明の第１の実施の形態における動作のフローチャートである。
【図４】本発明の第１の実施の形態における興味強度関数の例である。
【図５】本発明の第１の実施の形態における閾値以上の興味強度関数の例である。
【図６】本発明の第１の実施の形態における視聴行動インデクシング処理のフローチャートである。
【図７】本発明の第１の実施の形態における音声強度区間を求める例である。
【図８】本発明の第１の実施の形態における興味強度関数設定処理のフローチャートである。
【図９】本発明の第１の実施の形態における興味強度関数設定処理を説明するための図である。
【図１０】本発明の第１の実施の形態における抽出区間に視聴行動合成する処理のフローチャートである。
【図１１】本発明の第２の実施の形態における興味強度関数設定処理を説明するための図である。
【図１２】本発明の第２の実施の形態における興味強度設定処理のフローチャートである。
【図１３】本発明の第２の実施の形態における興味強度関数設定処理のフローチャートである。
【図１４】本発明の第３の実施の形態における動作のフローチャートである。
【図１５】本発明の第３の実施の形態で生成される興味強度関数の例である。
【図１６】本発明の第４の実施の形態における動作のフローチャートである。
【図１７】本発明の第４の実施の形態における興味強度関数設定処理を説明するための図である。
【図１８】本発明の第４の実施の形態における視聴行動合成処理のフローチャートである。
【図１９】本発明の第４の実施の形態における視聴行動合成処理を説明するための図である。
【符号の説明】
５０１モニタ
５０２マイク
５０３カメラ
５０４鍵盤
５０５押しボタン
５０６計算機[0001]
TECHNICAL FIELD OF THE INVENTION
The present invention relates to a video summarizing method and a program, and a storage medium storing a video summarizing program, and more particularly to a video summarizing method and a program for creating a summary from a video, and a storage medium storing a video summarizing program.
[0002]
[Prior art]
2. Description of the Related Art Today, personal computers for general users and hard disk video recorders have a non-linear editing function, and it has become possible for individuals to individually create video images and summary videos of program videos.
[0003]
However, general video editing software is cumbersome to operate, and is difficult for general users to handle. Therefore, a technique has been proposed in which an operation such as pressing a button is performed at the time of a favorite scene during a shot video or a program video, and a summary video is automatically generated based on the operation history.
[0004]
As a first conventional method, a user presses a push button in a favorite scene while watching a video, and adds or subtracts an offset value to or from an arbitrary value set in advance from that point to obtain an IN point, an OUT point, and the like. Thus, there is a method of editing a favorite scene (for example, see Patent Document 1).
[0005]
As a second conventional method, there is a method of calculating the importance of each scene from the physical characteristics of each scene of the video, and connecting the scenes whose importance exceeds a certain threshold to obtain a summary video ( For example, see Patent Document 2).
[0006]
[Patent Document 1]
JP-A-2001-57660
[Patent Document 2]
JP-A-2001-119649
[0007]
[Problems to be solved by the invention]
However, the first conventional method described above has a problem that the IN point and the OUT point may be set at a halfway point in terms of the video cut configuration and the like. For example, the middle of a series of scenes showing a series of dance movements may be set at the IN point or the like. In such a case, the series of movements cannot be recognized in the summary video.
[0008]
Further, in the above technique, the length of the video is determined by the number of push button operations by the user and the offset value for setting the IN point and the OUT point. There is a problem that can not be.
[0009]
Further, there is a problem that the operation of pressing the push button in a scene in which the user is interested while watching the video is troublesome, and the user may be absorbed in the video and forget to press the push button.
[0010]
Further, the second conventional method has a problem that, when a digest video is viewed by another person, the video is fragmented, so that it is difficult to grasp the content. For example, in the case of a drama program, even if a single scene is viewed, the story development before that scene is not known, so that the content cannot be grasped. Further, it is difficult to know the intention of each scene. For example, if you send a summary video containing a scene showing your favorite car to a friend as a recommended video and watch it by a friend, you may see that the car is recommended in that scene or that the other scenes are shown. I do not know if it is recommended.
[0011]
The present invention has been made in view of the above points, and a computer analyzes a natural viewing behavior (vocalization, gesture, etc.) when a viewer is watching a video, and determines a video section in which the viewer is interested. At that time, (1) the IN point and the OUT point of the summary video are rarely set to a halfway point in terms of the cut configuration of the video and the like, and (2) (3) A video summarizing method and program, and a video summarizing method capable of comprehending the contents of the original video only from the summary video and understanding the intention of selecting each scene of the summary video. It is an object to provide a storage medium storing a program.
[0012]
[Means for Solving the Problems]
FIG. 1 is a diagram for explaining the principle of the present invention.
[0013]
The present invention provides a video summarizing method for creating a summary from a video,
A change point extraction process (step 1) for extracting a video index based on a change point including a cut point of an input video;
A viewing behavior extraction process (step 2) for extracting any or all of the utterance section extraction, the body movement section extraction, and the performance section extraction based on the viewing behavior performed by the user during the video viewing;
Interest intensity function setting processing (step 3) for setting an interest intensity function indicating a degree of interest of the user at each point in time of the video;
An interest intensity section extraction process (step 4) for extracting a section in which the interest intensity is equal to or greater than a predetermined threshold from the sections extracted by the viewing behavior extraction process;
And a reproduction process (step 5) for continuously reproducing the sections extracted by the interest intensity section extraction processing.
[0014]
Further, in the present invention, in the viewing behavior extracting process (step 2),
Including a viewing behavior indexing process that measures the intensity / length of the viewing behavior, or both,
In the interest strength function setting process (step 3),
According to the strength / length of the viewing behavior, a process of setting the interest intensity function as a continuous value and changing the interest intensity at a change point of the video index is included.
[0015]
Further, the present invention includes a process of extracting a section in which the interest level is equal to or more than a predetermined threshold and synthesizing the viewing behavior with the section.
[0016]
The present invention also provides a change point extraction process (step 1), and a process of extracting an index whose section matches the video index and the viewing behavior index extracted in the viewing behavior extraction process (step 2),
Includes processing to increase the amount of change in the interest level at the start and end points of the matching index.
[0017]
According to the present invention, there is provided a process for extracting an index having the same feature amount as a section from the video indexes extracted in the change point extraction process (step 1);
The method includes a process of extracting a section in which the interest level is equal to or greater than a predetermined threshold, and synthesizing all viewing behaviors corresponding to the index whose feature amount matches the section.
[0018]
The present invention is a video summary program for creating a summary from a video,
A change point extraction step of extracting a video index by a change point including a cut point of the input video,
By measuring the intensity / length or both of the viewing behaviors performed by the user during viewing the video, one or all of the utterance section extraction, the body movement section extraction, and the performance section extraction are extracted as the viewing action index. Watching action extraction step of performing
An interest intensity function setting step of setting an interest intensity function as a continuous value according to the intensity / length of the viewing behavior, and changing the interest intensity at a change point of the video index;
A change point extraction step, a section matching index extraction step of extracting an index whose section matches with the video index and the viewing behavior index extracted in the viewing behavior extraction step,
A change amount increasing step of increasing the change amount of the interest intensity at the start point and the end point of the index,
A feature amount matching index extracting step of extracting an index in which a section and a feature amount match in the video index extracted in the change point extracting step;
Extracting a section in which the interest level is equal to or greater than a predetermined threshold, and synthesizing all viewing behaviors corresponding to an index whose feature amount matches the section in the section;
A playback step of combining the extracted video section with the viewing behavior of the extracted video section and performing continuous playback.
[0019]
The present invention is a storage medium storing a video summary program for creating a summary from a video,
A change point extraction step of extracting a video index by a change point including a cut point of the input video,
By measuring the intensity / length or both of the viewing behaviors performed by the user during viewing the video, one or all of the utterance section extraction, the body movement section extraction, and the performance section extraction are extracted as the viewing action index. Watching action extraction step of performing
An interest intensity function setting step of setting an interest intensity function as a continuous value according to the intensity / length of the viewing behavior, and changing the interest intensity at a change point of the video index;
A change point extraction step, a section matching index extraction step of extracting an index whose section matches with the video index and the viewing behavior index extracted in the viewing behavior extraction step,
A change amount increasing step of increasing the change amount of the interest intensity at the start point and the end point of the index,
A feature amount matching index extracting step of extracting an index in which a section and a feature amount match in the video index extracted in the change point extracting step;
Extracting a section in which the interest level is equal to or greater than a predetermined threshold, and synthesizing all viewing behaviors corresponding to an index whose feature amount matches the section in the section;
A playback step of combining the extracted video section with the viewing behavior of the extracted video section and performing continuous playback is stored.
[0020]
As described above, in the present invention, a cut point or the like of a video is extracted, an indexing process of a viewing action such as vocalization is performed, and a function for changing the interest level at a change point of the video index is set. By extracting sections that are equal to or greater than the threshold value and performing continuous playback of the extracted sections, the IN point and OUT point of the summary video are less likely to be set at half-way points in terms of the video cut configuration and the like.
[0021]
Also, in the present invention, a cut point or the like of a video is extracted, the strength / length of the viewing behavior such as utterance, or both are measured, and an interest strength function is calculated according to the strength / length of the viewing behavior. Set as a continuous value, set an interest intensity function that changes the interest intensity at the changing point of the video index, extract sections where the interest intensity is equal to or greater than the threshold, and perform continuous playback of the extracted sections to summarize any time length It is possible to create a video.
[0022]
Further, in the present invention, an interest of extracting a cut point or the like of a video, performing an indexing process of a viewing action such as a voice, performing an indexing process of a viewing action such as a voice, and changing an interest level at a changing point of the video index. By setting an intensity function, extracting a section where the interest level is equal to or higher than the threshold, synthesizing the viewing behavior with the extracted section, and performing continuous playback of the extracted section, it is possible to grasp the contents of the original video only from the summary video, It is possible to understand the selection intention of each scene of the summary video.
[0023]
BEST MODE FOR CARRYING OUT THE INVENTION
Hereinafter, embodiments of the present invention will be described with reference to the drawings.
[0024]
FIG. 2 shows a system configuration according to an embodiment of the present invention. The system shown in the figure includes a monitor 501, a microphone 502, a camera 503, a keyboard 504, a push button 505, and a computer 506.
[0025]
The monitor 501 displays a program video and the like, and the user views it.
[0026]
The microphone 502 records the utterance performed by the user while viewing the video. For example, an utterance "great" is recorded in a fine play scene of a sports program.
[0027]
The camera 503 records a body motion such as a gesture of the user. For example, a situation in which the user makes a guts pose in the scene of the moment of victory of the user's favorite player in a sports program is recorded.
[0028]
The keyboard 504 records the performance of the user. For example, in a documentary program, a celebration performance performed by a user in a scene where the performer has achieved some success is recorded.
[0029]
The push button 505 records the user's push button operation. Here, the push button 505 is capable of recording the push-down pressure of the button and the push-down continuation time in addition to the binary information of ON and OFF. For example, in a cooking program, a push button operation performed by a user in a scene where flips of ingredients are presented is recorded.
[0030]
The computer 506 performs transmission of an image to the monitor 501, signal input from the microphone 502, the camera 503, and the keyboard 504, and various processes.
[0031]
In this system, a user acquires various viewing behaviors (vocalization, physical movement, etc.) with a microphone or the like, and generates a summary video with a video section in which the viewing behavior has occurred as an important scene.
[0032]
The following can be considered as the viewing behavior.
[0033]
・ Voice:
Example: "Awesome", "It looks good", "It's good", Singing
・ Body movement:
Example: Guts pose, wink, jump
・ Playing instruments:
Example: Guitar improvisation, "Kimigayo" piano performance
・ Push button down
Among these viewing behaviors, “vocalization”, “physical movement”, and “musical instrument playing” are actions that the user naturally performs while watching the video, and are more spiritual to the user than “pressing a push button”. This is desirable as a viewing behavior acquired for generating a summary video because of a low burden on the user.
[0034]
[First Embodiment]
FIG. 3 is a flowchart of the operation according to the first embodiment of the present invention.
[0035]
Step 101) Indexing processing such as cut point extraction, audio section extraction, and music section extraction of an input video is performed. The processing for this includes, for example, [Yukinobu Taniguchi, Yoshinobu Tonomura, Hiroshi Hamada: Video Shot Switching Detection Method and Its Application to Video Access Interface, IEICE Transactions, Vol. J79-D2, No. 4, pp. 538-546], [Kenichi Minami, Akito Akutsu, Hiroshi Hamada, Yoshinobu Tonomura: Video Indexing Using Sound Information and Its Applications, IEICE Transactions, Vol. J81-D2, No. 3, pp. 529-537]. Here, an index that can be detected by video processing has been described as an example. However, a break in a news item or a scene of a drama can be detected and used as an index based on information on subtitle broadcasting.
[0036]
Step 102) Indexing processing such as utterance section extraction, body action section extraction, performance section extraction, and the like for utterance, body movement, performance, and the like performed by the user while watching video is performed. Details will be described later.
[0037]
Further, the order of step 101 and step 102 may be reversed. Step 101 and step 102 may be performed simultaneously.
[0038]
Step 103) An interest strength function as shown in FIG. 4 is set. The interest intensity function represents the degree of interest of the user at each point in the time length of the video, and takes a value from 0 to 1, where a larger value indicates a higher interest. The method of estimating the interest intensity function will be described later.
[0039]
Step 104) Sections (scenes 1 to 3) in which the interest level is equal to or greater than a certain threshold as shown in FIG. 5 are extracted. The threshold is arbitrarily set by the user. Further, the user may arbitrarily set the summary video time and determine the threshold value according to the summary video time.
[0040]
Step 105) The viewing action is combined with the video of the video section extracted in Step 104. Details will be described later.
[0041]
Step 106) The section extracted in Step 104 is continuously reproduced. This is the summary video. Alternatively, a summary video may be created as one video file by connecting videos in the extraction section. Further, a summary video may be realized as a list of time information of the extraction section.
[0042]
Next, the viewing behavior indexing process in step 102 will be described in detail.
[0043]
FIG. 6 is a flowchart of the viewing behavior indexing process according to the first embodiment of the present invention.
[0044]
Step 201) The type of viewing behavior (vocalization, physical action, performance, button press, etc.) is determined. This is because the signals obtained from the microphone 502, the camera 503, the keyboard 504, the push buttons 505, and the like are stored as separate files in the computer 505, and the type of the viewing behavior is determined by determining which file is to be analyzed. Is determined. In addition, signals obtained from the microphone 502, the camera 503, the keyboard 504, the push button 505, and the like are stored in separate tracks in the same file, and the track to be analyzed is determined by determining which track is to be analyzed. May be determined.
[0045]
Step 202) It is determined whether the type determined in step 201 is utterance. If it is an utterance, the process proceeds to step 203. If not, the process proceeds to step 205.
[0046]
Step 203) The strength of the uttered voice signal is calculated. Calculation of the intensity of the uttered voice signal is based on the power value of the voice signal. The strength calculation may be performed based on the power of the voice signal, the utterance measure of the voice, the volume of the voice, and the like.
[0047]
Step 204) A time section in which the voice intensity obtained in step 203 is equal to or greater than a predetermined threshold is extracted (FIG. 7), and is defined as a voice section (L). This method is described in [Kenichi Minami, Akito Akutsu, Hiroshi Hamada, Yoshinobu Tonomura: Video Indexing Using Sound Information and Its Applications, IEICE Transactions, Vol. J81-D2, No. 3, pp. 529-537] and the like. Here, the utterance section may be determined to be the utterance section only when the time length of the utterance section is longer than a certain time.
[0048]
Step 205) It is determined whether the type determined in step 201 is a physical movement. If it is a body movement, the process proceeds to step 206. If it is not a body movement, the process proceeds to step 208.
[0049]
Step 206) The strength of the body motion is calculated. The difference value between the frames of the video captured by the camera 503 is defined as the strength of the body motion. The calculation of the difference value between frames is based on the method used in cut point detection [Yukinobu Taniguchi, Yoshinobu Tonomura, Hiroshi Hamada: Video shot switching detection method and its application to video access interface, IEICE Transactions on Communications, Vol. J79-D2, No. 4, pp. 538-546].
[0050]
Step 207) A time interval in which the physical activity intensity obtained in Step 206 is equal to or greater than a predetermined threshold is defined as a physical activity interval. Here, only when the time length of the body movement section is equal to or longer than a certain time, it may be determined that the body movement section is a body movement section.
[0051]
Step 208) It is determined whether the type determined in step 201 is performance. If it is a performance, the process proceeds to step 209; otherwise, the process proceeds to step 211.
[0052]
Step 209) The performance intensity is calculated. The power of the sound signal of the performance is calculated, and is set as the performance intensity. Alternatively, the strength calculation may be performed based on the performance tempo, keying speed, and the like.
[0053]
Step 210) A time section in which the performance intensity obtained in step 209 is equal to or greater than a certain threshold is extracted and set as a performance section. Here, a performance section may be determined only when the time length of the performance section is equal to or longer than a predetermined time.
[0054]
Step 211) It is determined whether the type determined in step 201 is a button press. If the button is pressed, the process proceeds to step 212. If the button is not depressed, the process proceeds to step 216.
[0055]
Step 212) The strength of the button press is calculated. The pressure of the button press is obtained, and the strength of the press is obtained. Further, the strength calculation may be performed based on the detection of the pressing speed or the detection of the rapid fire. For example, when the button is fired continuously, it is conceivable that the intensity value is set to be higher than that of a normal depression.
[0056]
Step 213) A time section in which the button pressing strength obtained in step 209 is equal to or greater than a certain threshold value is extracted and set as a button pressing section.
[0057]
Step 214) The section extracted in step 204, step 207, step 210, or step 213 is saved as a list of viewing action sections for later processing. The viewing action section can be expressed as, for example, a set of a section start time / end time and a viewing action type.
[0058]
Step 215) It is determined whether or not the processing has been completed for all viewing behaviors. If the processing has been completed, the processing ends. If the processing has not been completed, the process proceeds to step 201, and another viewing behavior is performed.
[0059]
Step 216) Notify the user of the error and end the process. Further, when handling viewing behavior other than utterance, body movement, performance, and button press, the same processing as utterance or the like may be performed on the action without performing error notification.
[0060]
Next, the interest intensity function setting process in step 103 will be described in detail.
[0061]
FIG. 8 is a flowchart of the interest intensity function setting process according to the first embodiment of the present invention.
[0062]
Hereinafter, the processing from step 301 to step 309 is repeated for all viewing action sections.
[0063]
Step 301) One viewing action is selected, and the action type (vocalization, physical action, performance, push button, etc.) is determined. This is because the signals obtained from the microphone 502, the camera 503, the keyboard 504, the push buttons 505, and the like are stored as separate files in the computer 505, and the type of the viewing behavior is determined by determining which file is to be analyzed. Is determined. In addition, signals obtained from the microphone 502, the camera 503, the keyboard 504, the push button 505, and the like are stored in separate tracks in the same file, and the track to be analyzed is determined by determining which track is to be analyzed. May be determined.
[0064]
Step 302) The H value indicating the maximum value of the interest intensity is calculated by the following equation.
[0065]
(Equation 1)

The H value is a value obtained by estimating the degree of interest of the user from the recording of the viewing behavior, and takes a value from 0 to 1, and a larger value indicates a higher interest.
[0066]
Here, the ck value is determined according to each viewing behavior type as shown in Table 1, and is relatively indicated by the strength of association between the viewing behavior type and the degree of interest.
[0067]
[Table 1]

For example, the values shown in Table 1 indicate that the relationship between the degree of interest and the utterance is stronger than the relationship with the gesture.
[0068]
The L value is the time length of the viewing action section (FIG. 7). As the time of the viewing behavior is longer, the H value increases, and the interest intensity increases. For example, when the user makes a gesture for a long time, the interest level is high.
[0069]
The P value is a time average of the strength of various viewing behaviors such as the strength of the audio signal and the strength of the body movement (FIG. 7). As the strength of various viewing behaviors increases, the H value increases, and the interest strength increases. For example, when the user utters a loud voice, the interest level is high. The C value is an appropriate constant. The reason why the sigmoid function is used in the calculation of the H value is to make the value range from 0 to 1. Also, any function that returns a value from 0 to 1 may be used.
[0070]
The following steps 303 to 309 will be described with reference to FIG.
[0071]
Step 303) Set the interest level of the section from the point in time t seconds back from the start point of the viewing action section to the end point of the viewing action section as H. Here, the t value is determined according to each viewing behavior type as shown in Table 2.
[0072]
[Table 2]

The t value represents a time delay from the appearance of an interesting scene in the video until the user takes a viewing action. For example, Table 2 shows that the user often utters 0.5 seconds after an interesting scene appears.
[0073]
Step 304) Go back in time from the viewing action section and subtract the d value from the interest level at the index point of the video. Here, the index points of the video represent a cut point of the video, a start point / end point of the audio section, a start point / end point of the music section, and the like. The d value is determined according to each video index and each viewing behavior type as shown in Table 3.
[0074]
[Table 3]

This is because the content of the video often changes at the index point of the video, and it is considered that the degree of interest also changes. Therefore, the interest level is subtracted at the index point of the video. The magnitude of the d value indicates the strength of the association between the type of each index point and the change in the interest level. The greater the d value, the greater the change in interest level at the index point. Here, as shown in FIG. 9, in a section where the interest intensity functions set based on a plurality of viewing behaviors temporally overlap each other, a value having a larger interest level is set as the interest level of the section. Alternatively, the interest intensities may be averaged and used as the interest intensities in the section.
[0075]
Step 305) Determine whether the interest level is 0 or more. If the value is 0 or more, the process proceeds to step 306. If it is not 0 or more, the process proceeds to step 308.
[0076]
Step 306) Further, in the same way as in step 304, the interest intensity is subtracted by d value at the index point of the video, as in step 304.
[0077]
Step 307) Determine whether the interest level is 0 or more. If the value is 0 or more, the process proceeds to step 308. If the value is not 0 or more, the process proceeds to step 306.
[0078]
Step 308) The interest intensity less than 0 is set to 0. This is performed to set the interest intensity to a value in the range of 0 to 1.
[0079]
Step 309) The processing after step 304 is similarly performed for the section temporally after the viewing behavior.
[0080]
Step 310: It is determined whether the setting process of the interest intensity function has been completed for all viewing behaviors. If the processing has been completed, the processing ends. If the processing has not been completed, the process proceeds to step 301, and the interest strength function is set for another viewing behavior.
[0081]
Next, the processing of synthesizing the viewing behavior in the extracted section in step 105 will be described.
[0082]
FIG. 10 is a flowchart of a process of synthesizing a viewing action to a viewing section according to the first embodiment of the present invention.
[0083]
Step 401) The IN point and the OUT point of the video section extracted in Step 104 are obtained.
[0084]
Step 402) With respect to the recording of the viewing behavior (utterance, gesture, etc.), a cutout process is performed at the IN point and the OUT point acquired in step 401. Here, it is assumed that the time codes of the video and the viewing behavior are synchronized. For example, when synthesizing an utterance during viewing of a video, an utterance section at the same timing as the timing of utterance during viewing of the video is extracted.
[0085]
Step 403) A process for synthesizing the viewing behavior extracted in step 402 with the video is performed. When the type of the viewing behavior is “voicing”, the vocalizing audio signal is added to the video audio signal and synthesized. Alternatively, an uttered audio signal may be recorded in another audio track, and synthesized at the time of reproducing the summary video. Further, in consideration of the time delay of the utterance, the time may be advanced by the time shown in Table 2.
[0086]
When the type of the viewing behavior is “body movement”, a small screen is provided and displayed in the screen of the video. Further, the image and the image of the body motion may be alternately displayed. In addition, taking into account the time delay of the body movement, the composition may be advanced earlier by the time shown in Table 2.
[0087]
When the type of the viewing behavior is "performance", the performance audio signal is added to the video audio signal and synthesized. Alternatively, the sound signal of the performance may be recorded on another track of the sound, and may be synthesized at the time of reproducing the summary video. Also, taking into account the time delay of the performance, the composition may be advanced earlier by the time shown in Table 2.
[0088]
When the type of the viewing behavior is “push button”, the process proceeds to step 404. Also, a button mark or the like may be combined with the video. Further, in consideration of the time delay of the push button operation, the composition may be advanced earlier in time by the time shown in Table 2.
[0089]
Step 404) It is determined whether or not the processing has been completed for all viewing behaviors. If the processing has been completed, the processing ends. If the processing has not been completed, the process proceeds to step 401, and another viewing behavior is performed.
[0090]
As described above, the interest intensity is discontinuously changed at the cut point or the like of the video by the interest intensity function setting process shown in FIG. 3 and steps 103 and 104 shown in FIG. In the process of extracting a section equal to or more than the threshold (step 104), as shown in FIG. 5, the IN point and the OUT point of the summary video match the breaks of the video.
[0091]
Further, by calculating the interest level shown in FIG. 6 and the processing shown in step 104 in FIG. 3, the interest level of the important scene is continuously changed from 0 to 1 based on the viewing action level (calculated based on the size and duration of the utterance). Calculate as a value. At this time, the calculation is performed so that the interest intensity is high when the viewing activity intensity is high, and the interest intensity is low when the viewing activity intensity is low. Then, as shown in FIG. 5, the time length of the summary video can be made variable by setting a section equal to or more than the threshold as a scene of the summary video.
[0092]
Further, at step 105 in FIG. 10, it is possible to synthesize a viewing action (such as a vocalization) performed in the scene with the important scene cut out from the video in a temporally synchronized manner with the scene.
[0093]
[Second embodiment]
In this embodiment, the interest intensity function setting in the first embodiment is not set to a constant value in a section other than the index point, but is set in proportion to the time difference from the viewing behavior as shown in FIG. Is to reduce.
[0094]
The interest strength setting processing in step 103 will be described.
[0095]
FIG. 12 is a flowchart of the interest strength setting process according to the second embodiment of the present invention. In the figure, the same operations as those in FIG. 8 described above are denoted by the same step numbers, and description thereof is omitted.
[0096]
Steps 301 to 302 are the same as those in the first embodiment.
Step 603) Initialize the interest intensity function. Details will be described later.
[0097]
Steps 304 to 310 are the same as in the first embodiment.
Next, the interest intensity function setting process in step 603 will be described.
[0098]
FIG. 13 is a flowchart of the interest intensity function setting process according to the second embodiment of the present invention.
[0099]
Step 701) The interest level of the viewing action section is set to H.
[0100]
Step 702) For sections other than the viewing action section, subtract the interest level in proportion to the time difference from the viewing action section. The inclination of the interest intensity is set appropriately. Further, the inclination may be set separately according to the type of viewing behavior. Further, an arbitrary function (eg, Gaussian function) that decreases as the time difference from the viewing action section increases may be used.
[0101]
Step 703) The interest intensity less than 0 is set to 0. This is performed to set the interest intensity to a value in the range of 0 to 1.
[0102]
In this embodiment, since the time length of the summary video continuously increases and decreases with respect to the increase and decrease of the threshold value in step 104 of FIG. 3, the summary video can be set to an arbitrary length as compared with the first embodiment. There is a feature. Because, in the first embodiment, since the interest intensity function has a step-like shape as shown in FIG. 4, even if the threshold value is continuously increased or decreased, the time length of the generated summary video changes only discontinuously. On the other hand, in the present embodiment, since there is a section in which the interest intensity function changes continuously, the time length of the summary video changes continuously with an increase or decrease in the threshold. However, in the present embodiment, there is a problem that the IN point and the OUT point in a part of the summary video are likely to be set at half-way points in terms of the video cut configuration and the like.
[0103]
In the above-described second embodiment, the step 702 in FIG. 13 causes the interest intensity to change continuously in the time direction as shown in FIG. 11, so that the threshold in the important scene extraction processing in the step 104 is increased or decreased. On the other hand, the time length of the summary video similarly increases and decreases continuously (however, partly changes discontinuously).
[0104]
[Third Embodiment]
In the present embodiment, in the first embodiment, when there is a segment whose section matches between the video index and the viewing behavior index, the d value in the interest intensity function setting is increased, thereby The section is likely to be one scene of the summary video. For example, if the music section of the video index matches the utterance section of the index of the viewing behavior, it is considered that the user sang to the music in the music section, and the degree of interest in the viewing behavior is increased by the d value. By doing so, the music section is localized. This makes it easier for the section to be one scene of the summary video.
[0105]
FIG. 14 is a flowchart of the operation according to the third embodiment of the present invention.
In the figure, the same operations as those in the flowchart of FIG. 3 are denoted by the same step numbers, and description thereof is omitted.
[0106]
Steps 101 to 102 are the same as in the first embodiment.
[0107]
Step 803) The one in which the sections match between the video index obtained in Step 101 and the viewing behavior index obtained in Step 102 is detected. The coincidence of sections means that the start and end times of the sections coincide within a certain error range.
[0108]
Step 804) It is determined whether or not a matching section is detected in Step 803. If there is a matching section, the process proceeds to step 805; otherwise, the process proceeds to step 103.
[0109]
Step 805) The d value is updated to d ′ as shown in Table 4.
[0110]
[Table 4]

For d ′, a value larger than d is set in advance. Further, a constant value may be added to the d value. Further, the d value may be multiplied by a certain value. Further, the above-described addition value and multiplication value may be set separately according to the type of index and the type of viewing behavior. FIG. 15 is an example of the interest intensity function generated in the third embodiment of the present invention.
[0111]
Steps 103 to 106 are the same as in the first embodiment.
[0112]
In the third embodiment described above, in steps 803 to 805 of FIG. 14, a video index and a viewing index that match in a section are detected (step 803), and the start and end points of the section are detected. By making the subtraction value of the interest intensity function larger than the normal value (step 805), the shape of the interest intensity function becomes a high-rise building type as shown in FIG. Therefore, there is a high possibility that the section is extracted as it is.
[0113]
[Fourth Embodiment]
This embodiment is different from the first embodiment in that, when there is a video section group having the same section length and feature amount in the video index, a plurality of viewing / listenings are performed in the section group synchronized with the video section group. When the action (A) exists, the degree of interest in the viewing action excluding any one of the viewing actions (B) is reduced to make it less likely to be adopted as a scene of the summary video. , (B) are combined with a plurality of viewing behaviors (A) in the video section cut out based on the video segment.
[0114]
As an example, if the same BGM is played in two sections in the video and the user sings along with the BGM in the two sections, one video section is not used for the summary video but goes to the other section. Synthesize the two songs. As a result, in the summary video, an image in which two songs are overlapped and synthesized in one BGM section is obtained. More generally, a video in which a plurality of viewing behaviors are combined into one scene is obtained. Thereby, a plurality of viewing behaviors can be viewed in one scene of the summary video. The feature is that the user can obtain more additional information about the scene.
[0115]
FIG. 16 is a flowchart of the operation according to the fourth embodiment of the present invention. In the figure, the same operations as those in the flowchart of FIG. 3 are denoted by the same step numbers, and description thereof is omitted.
[0116]
Steps 101 to 102 are the same as in the first embodiment.
[0117]
Step 1103) Detect a matching index in the video index. The similarity between the time length and the strength values of various viewing behaviors is calculated for all two combinations of all video indexes, and if the similarity is equal to or greater than a certain threshold, it is detected as a matching index. Further, any other feature amount of the viewing action time length or intensity value may be used. As a similarity calculation method, there is a method of calculating a difference such as a time length and taking a reciprocal.
[0118]
Step 1104) It is determined whether or not a matching index is detected in step 1103. If detected, the process proceeds to step 1105. If not detected, the process proceeds to step 103.
[0119]
(Step 1105) The matching index detected in step 1103 is stored.
[0120]
Step 103 is the same as in the first embodiment.
[0121]
Step 1107) An interest strength suppression process for the matching index section is performed. Regarding the interest intensity function generated based on the index included in the index saved in step 1105, the interest intensity is reduced by multiplying the interest intensity function by a constant value as shown in FIG. The value to be multiplied at that time is set appropriately. Further, the value to be multiplied may be set separately according to the type of viewing behavior. However, the value of the interest-strength function generated based on any one of the matching index groups is kept as it is.
[0122]
Step 104 is the same as in the first embodiment.
[0123]
Step 1109) Viewing behavior synthesis processing is performed for the extracted section. Details will be described later.
Step 106 is the same as in the first embodiment.
[0124]
Hereinafter, the process of combining the viewing behavior with the extraction section in step 1109 will be described.
[0125]
FIG. 18 is a flowchart of the viewing behavior combining process according to the fourth embodiment of the present invention, and FIG. 19 is a diagram for describing the viewing behavior combining process according to the fourth embodiment of the present invention.
[0126]
18, the same operations as those in the flowchart shown in FIG. 10 described above are denoted by the same step numbers, and description thereof will be omitted.
[0127]
Steps 401 to 402 are the same as in the first embodiment.
[0128]
Step 1203) It is determined whether or not the section extracted in Step 402 is a matching index. If the section is included in the one saved in step 1105 in FIG. 16, it is a coincidence index, and if not, it is not a coincidence index.
[0129]
Step 1204) Convert the IN point and the OUT point when synthesizing the viewing behavior with the video. As shown in FIG. 19, the head of the section of the video index from which the interest intensity function in which the interest intensity is not suppressed is the IN point at the time of synthesis. Also, a point obtained by adding the section length of the viewing behavior to the IN point is set as the OUT point at the time of synthesis.
[0130]
Steps 403 to 404 are the same as in the first embodiment.
[0131]
By the steps 1103 to 1105, 1107, and 1109 in FIG. 16 of the fourth embodiment, a combination in which the section / feature amount matches in the video index is detected (A) (step 1103). The interest intensity function except for any one index (B) is reduced as shown in FIG. 17 (step 1107). This increases the possibility that only (B) will be extracted as an important scene when extracting an important scene in step 104. Next, in step 1109, as shown in FIG. 19, the viewing behavior performed in the section (A) can be combined and overlapped with the section (B).
[0132]
The flowcharts shown in FIGS. 3, 6, 8, 10, 12, 13, 14, 16, and 18 in the first to fourth embodiments are constructed as programs, and the It can also be installed on a computer used as a summarization device. It is also possible to distribute the information via a network.
[0133]
Further, the constructed program is stored in a hard disk device of a computer used as a video summarizing device, or in a portable storage medium such as a flexible disk or a CD-ROM. It can be installed and run.
[0134]
It should be noted that the present invention is not limited to the above-described embodiment, and various modifications and applications are possible within the scope of the claims.
[0135]
【The invention's effect】
As described above, according to the present invention, it is possible to generate a summary video based on a user's viewing behavior.
[0136]
According to the present invention, each important scene in the summary video can be a natural summary video with a break in terms of video configuration such as a video cut point and a start / end point of an audio section. Here, it is considered that a natural summary video is such that the joint of each important scene in the summary video matches a break in the video. For example, when a series of scenes showing a series of dance motions is present in the video and the scene is extracted as an important scene, the IN point and the OUT point are not set in the middle of the scene. The IN point and the OUT point of the scene, and the cut points before and after the IN point and the OUT point become the IN point and the OUT point of the important scene, so that it is possible to generate a natural summary video that is separated.
[0137]
Also, according to the present invention, the user can change the configuration of the summary video by designating the degree of interest intensity. In addition, a summary video can be generated by roughly specifying the time length of the summary video.
[0138]
Furthermore, since the threshold of the interest intensity can be increased or decreased, the time length of the summary video can be set with higher accuracy.
[0139]
In addition, by synthesizing the viewing behavior performed in the scene in time synchronization with the important scene cut out from the video into the scene, the original video and the viewing behavior can be simultaneously viewed in the summary video. . For example, by adding an utterance "I went there last summer" to the scene of the introduction of a certain sightseeing spot, it would be helpful for remembering the memory at the time of viewing later. Serves as a comment on the scene.
[0140]
In addition, by detecting a match between the sections in the video index and the viewing index, and increasing the subtraction value of the interest intensity function at the start point and the end point of the section from the normal value, the user can cut the video cut configuration. When an intentional viewing action is performed on a music section or the like in terms of a section, a summary video including important scenes reflecting the intention is generated. For example, when the user sings along with the music played in the video, the singing section is likely to be one scene of the summary video (no extra scenes are added before and after the section). .
[0141]
In addition, by detecting a combination in which the section / feature amount matches, if the video has redundant contents in terms of announcements, BGM, etc., a summary video without such redundant content can be generated. BGM is performed in two places and the user sings in both sections. In the summary video, two singings are overlapped and synthesized in one BGM section. In the summary video, the viewing action performed can be viewed together in one scene.
[Brief description of the drawings]
FIG. 1 is a diagram for explaining the principle of the present invention.
FIG. 2 is a system configuration diagram according to an embodiment of the present invention.
FIG. 3 is a flowchart of an operation according to the first embodiment of the present invention.
FIG. 4 is an example of an interest intensity function according to the first embodiment of the present invention.
FIG. 5 is an example of an interest intensity function that is equal to or greater than a threshold according to the first embodiment of the present invention.
FIG. 6 is a flowchart of a viewing behavior indexing process according to the first embodiment of the present invention.
FIG. 7 is an example of obtaining a voice intensity section according to the first embodiment of the present invention.
FIG. 8 is a flowchart of an interest strength function setting process according to the first embodiment of the present invention.
FIG. 9 is a diagram for explaining an interest intensity function setting process according to the first embodiment of the present invention.
FIG. 10 is a flowchart of a process of synthesizing a viewing behavior in an extraction section according to the first embodiment of the present invention.
FIG. 11 is a diagram for describing an interest intensity function setting process according to the second embodiment of the present invention.
FIG. 12 is a flowchart of an interest strength setting process according to the second embodiment of the present invention.
FIG. 13 is a flowchart of an interest intensity function setting process according to the second embodiment of the present invention.
FIG. 14 is a flowchart of an operation according to the third embodiment of the present invention.
FIG. 15 is an example of an interest intensity function generated in the third embodiment of the present invention.
FIG. 16 is a flowchart of an operation according to the fourth embodiment of the present invention.
FIG. 17 is a diagram for describing an interest intensity function setting process according to the fourth embodiment of the present invention.
FIG. 18 is a flowchart of a viewing behavior combining process according to the fourth embodiment of the present invention.
FIG. 19 is a diagram illustrating a viewing behavior combining process according to a fourth embodiment of the present invention.
[Explanation of symbols]
501 monitor
502 microphone
503 camera
504 keyboard
505 push button
506 calculator

Claims

In a video summarization method for creating a summary from a video,
A change point extraction process for extracting a video index based on a change point including a cut point of the input video,
A viewing behavior extraction process of extracting any or all of the utterance section extraction, the body movement section extraction, and the performance section extraction by the viewing behavior performed by the user during the viewing of the video, and
Interest intensity function setting processing for setting an interest intensity function indicating the degree of interest of the user at each point in time of the video,
Interest strength section extraction processing for extracting a section whose interest strength is equal to or greater than a predetermined threshold from the section extracted by the viewing behavior extraction processing,
And a reproduction process for continuously reproducing the sections extracted by the interest intensity section extraction processing.

In the viewing behavior extraction processing,
Including a viewing behavior indexing process that measures the intensity / length of the viewing behavior, or both,
In the interest strength function setting process,
2. The video summarizing method according to claim 1, further comprising: setting an interest intensity function as a continuous value in accordance with the intensity / length of the viewing behavior, and changing the interest intensity at a change point of the video index.

3. The video summarizing method according to claim 1, further comprising a process of extracting a section in which the interest level is equal to or greater than a predetermined threshold and combining the viewing behavior with the section.

The change point extraction process, and a process of extracting an index whose section matches the video index and the viewing behavior index extracted in the viewing behavior extraction process;
4. The video summarizing method according to claim 1, further comprising a process of increasing an amount of change in interest level at a start point and an end point of the matching index.

A process of extracting an index whose section and feature amount coincide with each other among the video indexes extracted in the change point extraction process;
The video summarization according to claim 1, further comprising: extracting a section in which the interest level is equal to or greater than a predetermined threshold, and synthesizing the section with all viewing actions corresponding to an index in which the section and the feature amount match. Method.

A video summarization program for creating a summary from a video,
A change point extraction step of extracting a video index by a change point including a cut point of the input video,
By measuring the intensity / length or both of the viewing behaviors performed by the user during viewing the video, one or all of the utterance section extraction, the body movement section extraction, and the performance section extraction are extracted as the viewing action index. Watching action extraction step of performing
An interest intensity function setting step of setting an interest intensity function as a continuous value according to the intensity / length of the viewing behavior, and changing the interest intensity at a change point of the video index;
The change point extraction step, a section matching index extraction step of extracting an index whose section matches the video index and the viewing behavior index extracted in the viewing behavior extraction step,
A change amount increasing step of increasing a change amount of interest intensity at a start point and an end point of the index,
In the video index extracted in the change point extraction step, a feature amount matching index extraction step of extracting an index in which a section and a feature amount match,
Extracting a section in which the interest intensity is equal to or greater than a predetermined threshold, and synthesizing all the viewing behaviors corresponding to the index in which the section and the feature amount coincide with the section;
A video summarizing program, comprising: synthesizing the viewing behavior of the extracted video section with the extracted video section, and executing a playback step of continuously playing back.

A storage medium storing a video summary program for creating a summary from a video,
A change point extraction step of extracting a video index by a change point including a cut point of the input video,
By measuring the intensity / length or both of the viewing behaviors performed by the user during viewing the video, one or all of the utterance section extraction, the body movement section extraction, and the performance section extraction are extracted as the viewing action index. Watching action extraction step of performing
An interest intensity function setting step of setting an interest intensity function as a continuous value according to the intensity / length of the viewing behavior, and changing the interest intensity at a change point of the video index;
The change point extraction step, a section matching index extraction step of extracting an index whose section matches the video index and the viewing behavior index extracted in the viewing behavior extraction step,
A change amount increasing step of increasing a change amount of interest intensity at a start point and an end point of the index,
In the video index extracted in the change point extraction step, a feature amount matching index extraction step of extracting an index in which a section and a feature amount match,
Extracting a section in which the interest intensity is equal to or greater than a predetermined threshold, and synthesizing all the viewing behaviors corresponding to the index in which the section and the feature amount coincide with the section;
A storage medium storing a video summarization program, characterized by storing a program comprising: a playback step of combining the extracted video section with a viewing action of the extracted video section to perform continuous playback.