JP2014170980A

JP2014170980A - Information processing apparatus, information processing method, and information processing program

Info

Publication number: JP2014170980A
Application number: JP2011107104A
Authority: JP
Inventors: Masumi Ishikawa; 真澄石川
Original assignee: NEC Corp
Current assignee: NEC Corp
Priority date: 2011-05-12
Filing date: 2011-05-12
Publication date: 2014-09-18
Also published as: WO2012153747A1

Abstract

PROBLEM TO BE SOLVED: To generate an image content in which the atmosphere of a source image remains.SOLUTION: An information processing apparatus comprises: extraction means which extracts at least two partial moving images or static images from a source moving image; determination means which determines, on the basis of the feature of the source moving image, a method for presenting the at least two partial moving images or static images extracted by the extraction means; and generation means which generates an image content including the at least two partial moving images or static images, on the basis of the presentation method determined by the determination means.

Description

本発明は、動画の一部を抽出して、新たな映像コンテンツを生成する技術に関する。 The present invention relates to a technique for extracting a part of a moving image and generating new video content.

映像から抽出した画像をもとに新規映像を生成する技術の一例が、特許文献１に記載されている。特許文献１では、入力映像を構成する各フレーム間の内容（カメラの動きや物体の動きや人の顔等の映像内容や、音声イベント等の音響的内容）をもとにキーフレームを選択し、選択したキーフレームを任意の提示時間で表示する新規映像を生成する。 An example of a technique for generating a new video based on an image extracted from a video is described in Patent Document 1. In Patent Document 1, key frames are selected based on the contents between frames constituting an input video (video contents such as camera movements, object movements, human faces, and acoustic contents such as audio events). Then, a new video that displays the selected key frame at an arbitrary presentation time is generated.

また、入力映像から抽出したキーフレームをもとに新規映像を生成する方式の一例が、特許文献２に記載されている。特許文献２では、入力映像から、顔検出またはユーザ操作に応じて選択したキーフレームを、そのキーフレーム中の顔の大きさ、笑顔、年齢、向き、操作情報をもとに決定した提示時間で表示する新規映像を生成する。 An example of a method for generating a new video based on a key frame extracted from an input video is described in Patent Document 2. In Patent Document 2, a key frame selected according to face detection or user operation from an input video is displayed with a presentation time determined based on the face size, smile, age, orientation, and operation information in the key frame. Create a new video to display.

特表２００９−５３７０４７号公報Special table 2009-537047 gazette 特開２０１０−２１３１３６号公報JP 2010-213136 A

しかしながら、上記従来技術は、生成された映像は、ソース映像中でのキーフレーム間の関連性を考慮することなく提示される。そのため、ソース映像中で、互いに共通の被写体の変化や動作の遷移、あるいは、互いに共通の意図で撮影された被写体を、新規映像の視聴により理解できない場合がある。 However, according to the above prior art, the generated video is presented without considering the relevance between key frames in the source video. For this reason, in the source video, there may be cases where it is not possible to understand a change in a subject common to each other, a transition in operation, or a subject shot with a common intention by viewing a new video.

本発明の目的は、上述の課題を解決する技術を提供することにある。 The objective of this invention is providing the technique which solves the above-mentioned subject.

上記目的を達成するため、本発明に係る装置は、
ソース動画から、少なくとも２つの部分動画または静止画像を抽出する抽出手段と、
前記抽出手段で抽出した前記少なくとも２つの部分動画または静止画像の提示方法を、前記ソース動画の特徴に基づいて決定する決定手段と、
前記決定手段で決定した提示方法に基づいて、前記少なくとも２つの部分動画または静止画像を含む映像コンテンツを生成する生成手段と、
を備えたことを特徴とする。 In order to achieve the above object, an apparatus according to the present invention provides:
Extraction means for extracting at least two partial videos or still images from the source video;
Determining means for determining a method of presenting the at least two partial moving images or still images extracted by the extracting means based on characteristics of the source moving images;
Generating means for generating video content including the at least two partial moving images or still images based on the presentation method determined by the determining means;
It is provided with.

上記目的を達成するため、本発明に係る方法は、
ソース動画から、少なくとも２つの部分動画または静止画像を抽出する抽出ステップと、
前記抽出ステップで抽出した前記少なくとも２つの部分動画または静止画像の提示方法を、前記ソース動画の特徴に基づいて決定する決定ステップと、
前記決定ステップで決定した提示方法に基づいて、前記少なくとも２つの部分動画または静止画像を含む映像コンテンツを生成する生成ステップと、
を含むことを特徴とする。 In order to achieve the above object, the method according to the present invention comprises:
An extraction step of extracting at least two partial videos or still images from the source video;
A determination step of determining a method of presenting the at least two partial moving images or still images extracted in the extracting step based on characteristics of the source moving images;
Generating a video content including the at least two partial moving images or still images based on the presentation method determined in the determining step;
It is characterized by including.

上記目的を達成するため、本発明に係るプログラムは、
ソース動画から、少なくとも２つの部分動画または静止画像を抽出する抽出ステップと、
前記抽出ステップで抽出した前記少なくとも２つの部分動画または静止画像の提示方法を、前記ソース動画の特徴に基づいて決定する決定ステップと、
前記決定ステップで決定した提示方法に基づいて、前記少なくとも２つの部分動画または静止画像を含む映像コンテンツを生成する生成ステップと、
をコンピュータに実行させることを特徴とする。 In order to achieve the above object, a program according to the present invention provides:
An extraction step of extracting at least two partial videos or still images from the source video;
A determination step of determining a method of presenting the at least two partial moving images or still images extracted in the extracting step based on characteristics of the source moving images;
Generating a video content including the at least two partial moving images or still images based on the presentation method determined in the determining step;
Is executed by a computer.

本発明によれば、ソース映像の雰囲気を残した映像コンテンツを生成することができる。 According to the present invention, it is possible to generate video content that retains the atmosphere of the source video.

本発明の第１実施形態に係る情報処理装置の構成を示すブロック図である。It is a block diagram which shows the structure of the information processing apparatus which concerns on 1st Embodiment of this invention. 本発明の第２実施形態に係る情報処理装置の構成を示すブロック図である。It is a block diagram which shows the structure of the information processing apparatus which concerns on 2nd Embodiment of this invention. 本発明の第２実施形態に係る情報処理装置のスライドショー生成を説明する図である。It is a figure explaining the slide show production | generation of the information processing apparatus which concerns on 2nd Embodiment of this invention. 本発明の第２実施形態に係る情報処理装置の処理の流れを示すフローチャートである。It is a flowchart which shows the flow of a process of the information processing apparatus which concerns on 2nd Embodiment of this invention. 本発明の第２実施形態に係る情報処理装置の映像コンテンツ生成を説明する図である。It is a figure explaining the video content production | generation of the information processing apparatus which concerns on 2nd Embodiment of this invention. 本発明の第２実施形態に係る情報処理装置の映像コンテンツ生成を説明する図である。It is a figure explaining the video content production | generation of the information processing apparatus which concerns on 2nd Embodiment of this invention. 本発明の第３実施形態に係る情報処理装置の映像コンテンツ生成を説明する図である。It is a figure explaining the video content production | generation of the information processing apparatus which concerns on 3rd Embodiment of this invention. 本発明の第３実施形態に係る情報処理装置の映像コンテンツ生成を説明する図である。It is a figure explaining the video content production | generation of the information processing apparatus which concerns on 3rd Embodiment of this invention. 本発明の第３実施形態に係る情報処理装置の映像コンテンツ生成を説明する図である。It is a figure explaining the video content production | generation of the information processing apparatus which concerns on 3rd Embodiment of this invention.

以下に、図面を参照して、本発明の実施の形態について例示的に詳しく説明する。ただし、以下の実施の形態に記載されている構成要素はあくまで例示であり、本発明の技術範囲をそれらのみに限定する趣旨のものではない。 Hereinafter, exemplary embodiments of the present invention will be described in detail with reference to the drawings. However, the components described in the following embodiments are merely examples, and are not intended to limit the technical scope of the present invention only to them.

［第１実施形態］
本発明の第１実施形態としての情報処理装置１００について、図１を用いて説明する。情報処理装置１００は、動画を編集して映像コンテンツを生成する装置である。 [First Embodiment]
An information processing apparatus 100 as a first embodiment of the present invention will be described with reference to FIG. The information processing apparatus 100 is an apparatus that generates video content by editing a moving image.

図１に示すように、情報処理装置１００は、抽出部１０１と提示方法決定部１０２と映像コンテンツ生成部１０３とを含む。 As illustrated in FIG. 1, the information processing apparatus 100 includes an extraction unit 101, a presentation method determination unit 102, and a video content generation unit 103.

抽出部１０１は、ソース動画から、少なくとも２つの部分動画または静止画像を抽出する。提示方法決定部１０２は、抽出部１０１で抽出した少なくとも２つの部分動画または静止画像の提示方法を、ソース動画の特徴に基づいて決定する。そして、映像コンテンツ生成部１０３は、提示方法決定部１０２で決定した提示方法に基づいて、少なくとも２つの部分動画または静止画像を含む映像コンテンツを生成する。映像コンテンツとしては、例えばスライドショーなどが含まれる。 The extraction unit 101 extracts at least two partial moving images or still images from the source moving image. The presentation method determination unit 102 determines the presentation method of at least two partial moving images or still images extracted by the extraction unit 101 based on the characteristics of the source moving image. The video content generation unit 103 generates video content including at least two partial moving images or still images based on the presentation method determined by the presentation method determination unit 102. The video content includes, for example, a slide show.

本実施形態によれば、ソース映像の雰囲気を残した映像コンテンツを生成することができる。 According to the present embodiment, it is possible to generate video content that leaves the atmosphere of the source video.

［第２実施形態］
（前提技術）
映像投稿サイトでは、視聴者が映像の内容を素早く把握して興味をもった映像を効率的に選択できるように、映像から抽出した静止画像や部分動画（以下、キーフレームと呼ぶ）をリスト提示する機能が用いられている。 [Second Embodiment]
(Prerequisite technology)
Video posting sites provide a list of still images and partial videos (hereinafter referred to as key frames) extracted from the video so that viewers can quickly grasp the video content and select videos of interest efficiently. Function is used.

視聴者が映像全体を適切に理解するためには、通常複数のキーフレームが必要である。しかし、多数のキーフレームをディスプレイ上に一度に提示すると、キーフレーム１枚あたりの大きさが小さくなり内容を十分確認できない場合がある。 In order for the viewer to properly understand the entire video, multiple key frames are usually required. However, when a large number of key frames are presented on the display at a time, the size per key frame may be reduced, and the content may not be sufficiently confirmed.

そこで、キーフレームを切り替えて順に提示する提示方法が有効と考えられる（以降、キーフレームを切り替えて順に提示した映像を映像コンテンツと呼ぶ）。映像コンテンツ中で連続して提示されるキーフレームの間の関連性を理解することは、入力映像の内容を把握する上で有意な場合がある。例えば、映像コンテンツ中で連続するキーフレームが、共通の被写体の変化や動作の遷移を表現しているとわかることで、動作や変化の流れを理解できる。また、連続するキーフレームに含まれる被写体が、入力映像内で共通の意図で（たとえば、同程度の興味で）撮影されたものであるとわかることで、被写体の重要度の関係性を理解できる。 Therefore, it is considered that a presentation method in which key frames are switched and presented in order is effective (hereinafter, videos presented in order by switching key frames are referred to as video contents). Understanding the relationship between key frames that are presented continuously in video content may be significant in understanding the content of the input video. For example, it can be understood that a continuous key frame in the video content expresses a change in a common subject or a transition of a motion, so that the flow of motion or change can be understood. In addition, it is possible to understand the relationship between the importance levels of subjects by knowing that subjects included in consecutive key frames are taken with a common intention (for example, with similar interest) in the input video. .

連続するキーフレームの関連性を理解させるためには、キーフレームの提示時間や、キーフレーム間に挿入されるエフェクト等の提示方法は重要な意味をもつ。例えば、連続するキーフレームが同じ方法で提示された場合、実際にはそのキーフレーム間に関連性が無くても、関連性があると視聴者が誤解する場合がある。また、連続するキーフレームがまったく異なる方法で提示された場合、キーフレーム間に関連性がないと視聴者が誤認する可能性がある。よって、キーフレーム間の関連性を視聴者に正しく理解させるためには、キーフレーム間の内容の関連性に応じて提示ルールを制御することが有効といえる。 In order to understand the relationship between successive key frames, the presentation method of key frame presentation time and effects inserted between key frames is important. For example, if consecutive key frames are presented in the same way, the viewer may misunderstand that there is a relationship even though there is actually no relationship between the key frames. Also, if consecutive key frames are presented in a completely different manner, the viewer may misunderstand that there is no association between the key frames. Therefore, in order to allow the viewer to correctly understand the relationship between the key frames, it can be said that it is effective to control the presentation rule according to the relationship between the contents of the key frames.

［構成］
本発明の第２実施形態に係る情報処理装置２００の構成について、図２を用いて説明する。図２は、本実施形態に係る情報処理装置２００の概略構成を説明するための図である。 [Constitution]
The configuration of the information processing apparatus 200 according to the second embodiment of the present invention will be described with reference to FIG. FIG. 2 is a diagram for explaining a schematic configuration of the information processing apparatus 200 according to the present embodiment.

情報処理装置２００は、キーフレーム抽出部２０１と提示方法決定部２０２と映像コンテンツ生成部２０３と映像入力部２０４とを備えている。 The information processing apparatus 200 includes a key frame extraction unit 201, a presentation method determination unit 202, a video content generation unit 203, and a video input unit 204.

キーフレーム抽出部２０１は、入力映像としてのソース動画２１０から、キーフレームとして、少なくとも２つの部分動画または静止画像を抽出する。提示方法決定部２０２は、キーフレーム抽出部２０１で抽出した少なくとも２つの部分動画または静止画像の提示方法を、ソース動画の特徴に基づいて決定する。映像コンテンツ生成部２０３は、提示方法決定部２０２で決定した提示方法に基づいて、少なくとも２つの部分動画または静止画像を連続的に提示する新規映像としての映像コンテンツ２４０を生成する。 The key frame extraction unit 201 extracts at least two partial moving images or still images as key frames from the source moving image 210 as input video. The presentation method determination unit 202 determines the presentation method of at least two partial moving images or still images extracted by the key frame extraction unit 201 based on the characteristics of the source moving image. The video content generation unit 203 generates video content 240 as a new video that continuously presents at least two partial moving images or still images based on the presentation method determined by the presentation method determination unit 202.

また、提示方法決定部２０２は、関連性判定部２２１と提示方法選択部２２２とを含む。関連性判定部２２１は、少なくとも２つの部分動画または静止画像に含まれる対象物に共通性があるか否かを、ソース動画に基づいて判定する。関連性判定部２２１は、少なくとも２つの部分動画または静止画像に含まれる対象物が同一か否かを、ソース動画に基づいて判定する。提示方法選択部２２２は、対象物に共通性がある場合には、対象物に共通性がない場合とは異なる提示方法を選択する。提示方法決定部２０２は、映像コンテンツ２４０における、キーフレームの提示時間を決定する。 In addition, the presentation method determination unit 202 includes an association determination unit 221 and a presentation method selection unit 222. The relevance determination unit 221 determines whether or not the objects included in at least two partial moving images or still images have commonality based on the source moving image. The relevancy determination unit 221 determines whether or not the objects included in at least two partial moving images or still images are the same based on the source moving image. When the object has commonality, the presentation method selection unit 222 selects a presentation method different from that when the object has no commonality. The presentation method determination unit 202 determines a key frame presentation time in the video content 240.

映像入力部２０４は、ビデオカメラなどからソース動画２１０を入力して、キーフレーム抽出部２０１および提示方法決定部２０２に渡す。キーフレーム抽出部２０１は、ソース動画２１０から抽出したキーフレームのみならず、そのキーフレームに関連するキーフレーム情報を映像コンテンツ生成部２０３に送る。キーフレーム情報とは、キーフレームを識別するキーフレームＩＤ、映像コンテンツ内での提示順位、キーフレームの画素情報である。 The video input unit 204 inputs the source moving image 210 from a video camera or the like, and passes it to the key frame extraction unit 201 and the presentation method determination unit 202. The key frame extraction unit 201 sends not only the key frame extracted from the source moving image 210 but also key frame information related to the key frame to the video content generation unit 203. The key frame information is a key frame ID for identifying the key frame, a presentation order in the video content, and pixel information of the key frame.

映像入力部２０４は、関連性判定部２２１からの要求に応じて、入力映像の情報（映像情報）を関連性判定部２２１に入力する。映像情報とは、キーフレームＩＤ、キーフレームに対応した区間の画素情報や音響情報とする。キーフレームに対応した区間とは、入力映像中のキーフレームが属する単位区間、あるいは、キーフレームと同一の被写体が含まれている単位区間とする。単位区間として、以下の４つの区間の何れかまたはその組合せを用いても良い。一定の時間間隔で区切った区間。カメラの切り変わり点等の撮影機器の制御信号をもとに区切られた区間。フレームの画像変化点や音響変化点等の映像から抽出される特徴量をもとに区切られた区間。場所や被写体や時間帯等の撮影内容の変化点として手動で区切られた区間。 The video input unit 204 inputs information on the input video (video information) to the relevance determination unit 221 in response to a request from the relevance determination unit 221. The video information is a key frame ID, pixel information of a section corresponding to the key frame, and acoustic information. The section corresponding to the key frame is a unit section to which the key frame in the input video belongs, or a unit section including the same subject as the key frame. Any of the following four sections or a combination thereof may be used as the unit section. Sections separated at regular intervals. A section divided based on the control signal of the photographic equipment such as the camera turning point. A section divided based on features extracted from video such as image change points and sound change points of frames. A section that is manually delimited as a point of change in shooting content such as location, subject, or time zone.

キーフレームに対応する区間は、各キーフレームに対して少なくとも１個存在し、複数のキーフレームが１個の区間に対応づけられても構わない。区間の画像情報とは、区間に属するフレームの画像情報とする。区間の音響情報とは、区間と同期した音情報とする。また、区間情報として、区間内に映っている被写体、撮影場所、撮影時刻を記述したメタ情報、ＧＰＳ等のセンサ情報を含めてもよい。 There may be at least one section corresponding to each key frame, and a plurality of key frames may be associated with one section. The section image information is image information of frames belonging to the section. The acoustic information of the section is sound information synchronized with the section. Further, as the section information, meta information describing the subject, the shooting location, and the shooting time shown in the section, and sensor information such as GPS may be included.

関連性判定部２２１は、キーフレーム抽出部２０１から入力されたキーフレーム情報をもとに、映像入力部２０４からキーフレームに対応する映像情報を取得し、キーフレーム間の関連性を判定する。関連性判定部２２１は、キーフレーム関連性情報を提示方法選択部２２２に入力する。キーフレーム関連性情報とは、キーフレームＩＤと関連性フラグとする。キーフレーム情報として、上記に加えてキーフレームの画素情報を入力してもよい。関連性フラグとは、あらかじめ規定された関連性種別のうち、現在のキーフレームとその後に提示されるキーフレームとの間に存在する関連性種別を示すデータ、もしくは、いずれの関連性種別も存在しない（関連性が無い）ことを示すデータである。たとえば、関連性フラグとして、あるキーフレームとその後のキーフレームとの間に存在する全関連性種別にフラグ１を設定し、存在しない関連種別にフラグ０を設定する。あるいは、関連性種別に応じて意味を持つ任意の数値を設定してもよい。 The relevance determination unit 221 acquires video information corresponding to the key frame from the video input unit 204 based on the key frame information input from the key frame extraction unit 201, and determines the relevance between the key frames. The relevancy determination unit 221 inputs key frame relevance information to the presentation method selection unit 222. The key frame relevance information is a key frame ID and a relevance flag. In addition to the above, key frame pixel information may be input as key frame information. The relevance flag is data indicating the relevance type existing between the current key frame and the key frame to be presented after that, or any relevance type among pre-defined relevance types It is data indicating that no (no relevance). For example, as a relevance flag, flag 1 is set for all relevance types existing between a key frame and a subsequent key frame, and flag 0 is set for a non-existing relevance type. Or you may set the arbitrary numerical value which has a meaning according to a relationship classification.

関連性判定部２２１は、キーフレームにおける対象物の撮影方法が同一か否かを、ソース動画に基づいて判定してもよい。関連性判定部２２１は、キーフレームに含まれる対象物に共通性があるか否かを、ソース動画の音響的な特徴に基づいて判定してもよい。 The relevancy determination unit 221 may determine whether the shooting method of the target object in the key frame is the same based on the source moving image. The relevance determination unit 221 may determine whether the objects included in the key frame have commonality based on the acoustic characteristics of the source video.

被写体の同一性に着目した関連性の判定方法について以下に述べる。 A method for determining the relevance focusing on the identity of the subject will be described below.

（関連性１．被写体の同一性）
関連性判定部２２１は、キーフレーム同士の被写体の同一性を、キーフレームを抽出したソース動画での撮影の連続性などによって決定することができる。このように決定された関連性を関連性１と称する。 (Relevance 1. Subject identity)
The relevance determination unit 221 can determine the identity of the subject between the key frames based on the continuity of shooting with the source moving image from which the key frames are extracted. The relationship thus determined is referred to as relationship 1.

「被写体が同一である」とは、映像コンテンツ中で連続するキーフレーム対が、共通の被写体であることをいう。キーフレーム対に含まれる被写体が見かけ上まったく変化のない場合も、キーフレーム対に含まれる被写体が一連の変化や動作の過程で互いに異なるある瞬間の場合も含む。たとえば、時間の経過に伴って色が変わる建造物を撮影した映像において、色が変化する前と後のキーフレームは、キーフレームのみから判断すると異なる被写体のように見えるが、ソース動画を参照することによって被写体が同一であると判断することができる。昆虫が孵化する様子を撮影したソース動画から抽出された、孵化の前と後のキーフレーム対の関係なども同様に、キーフレームのみから判断すると異なる被写体のように見えるが、ソース動画を参照することによって被写体が同一であると判断することができる。一方、時間の経過に伴って音色が変化する楽器を撮影した映像から、音色が変化する前後のキーフレームを抽出した場合には、キーフレームの抽出箇所の音声のみから判断すると、それらのキーフレームは異なる被写体を撮影したものと判断されうる。しかし、この場合も、ソース動画全体を参照することによって被写体が同一であると判断することができる。 “Subjects are the same” means that a pair of consecutive key frames in the video content is a common subject. This includes the case where the subject included in the key frame pair does not change at all in appearance and the case where the subject included in the key frame pair differs from each other in the course of a series of changes and operations. For example, in a video shot of a building that changes color over time, the key frames before and after the color change appear to be different subjects as judged from the key frame alone, but refer to the source video. Thus, it can be determined that the subject is the same. Similarly, the relationship between the key frame pair extracted before and after hatching, extracted from the source video that captures the insect hatching, appears to be a different subject when judged from the key frame alone, but refers to the source video. Thus, it can be determined that the subject is the same. On the other hand, if key frames before and after the timbre change are extracted from a video of a musical instrument whose timbre changes over time, those key frames can be determined based on only the sound at the key frame extraction location. Can be determined to have taken a different subject. However, in this case as well, it can be determined that the subject is the same by referring to the entire source moving image.

すなわち、ソース動画を参照することによって、より明確に同じ被写体を連続的に撮影した動画から抽出されたキーフレーム群であることが明確になる。つまり、ソース動画の編集点を見つけ、その間のキーフレーム群については被写体が同じだと推定することができる。 That is, by referring to the source moving image, it becomes clear that the key frame group is extracted from a moving image obtained by continuously shooting the same subject more clearly. That is, it can be estimated that the editing point of the source moving image is found and the subject is the same for the key frame group in the meantime.

関連性１についての関連性フラグには、あるキーフレームとその後のキーフレームとが、同一の被写体である場合に１を、同一の被写体でない場合に０を設定する。被写体の同一性判定は、たとえば以下の方法で実現できる。映像コンテンツ中で連続するキーフレーム対が同一の区間に対応づけられており、かつ、キーフレームから検出された被写体領域（被写体の画像領域）が同一の場合に、同一の被写体であると判定してもよい。あるいは、映像コンテンツ中で連続するキーフレーム対それぞれから被写体領域を検出し、各キーフレームに対応した区間からそれぞれ被写体領域を検出し、トラッキングする。一方の区間から検出された被写体領域およびトラック過程の被写体領域と、もう一方の区間から検出された被写体領域およびトラック過程の被写体領域とを比較し、色や形状などの画像特徴量が類似する場合には、被写体は同一であると判定してもよい。あるいは、映像コンテンツ中で連続するキーフレーム対について、各キーフレームに対応した区間からそれぞれ被写体の画像領域と被写体の発する音響情報を抽出する。 The relevance flag for relevance 1 is set to 1 when a certain key frame and the subsequent key frame are the same subject, and set to 0 when they are not the same subject. The identity determination of the subject can be realized by the following method, for example. When consecutive key frame pairs in the video content are associated with the same section and the subject area (image area of the subject) detected from the key frame is the same, it is determined that they are the same subject. May be. Alternatively, the subject area is detected from each pair of consecutive key frames in the video content, and the subject area is detected and tracked from the section corresponding to each key frame. When the subject area detected from one section and the subject area of the track process are compared with the subject area detected from the other section and the subject area of the track process, and the image feature values such as color and shape are similar May determine that the subject is the same. Alternatively, the image area of the subject and the acoustic information emitted from the subject are extracted from the section corresponding to each key frame for a pair of key frames that are continuous in the video content.

一方の区間から検出された被写体領域と、もう一方の区間から検出された被写体領域とで、画像的特徴と音響情報とがともに類似する場合に、キーフレーム対に含まれる被写体は同一であると判定してもよい。キーフレームから被写体領域を検出する手法は、あらかじめ登録された特定対象を検出する場合と、登録されていない一般対象を検出する場合とに分けられる。特定対象を検出する場合は、登録された各対象の画像データをテンプレートとし、様々な解像度に変換したテンプレートで区間に属するキーフレームを走査し、テンプレートと同じ位置の画素値の差分が小さい領域を対応する被写体領域として検出してもよい。 When the subject area detected from one section and the subject area detected from the other section have similar image characteristics and acoustic information, the subject included in the key frame pair is the same You may judge. The method of detecting the subject area from the key frame is divided into a case where a specific object registered in advance is detected and a case where a general object which is not registered is detected. When detecting a specific target, the image data of each registered target is used as a template, a key frame belonging to a section is scanned with a template converted into various resolutions, and an area with a small difference in pixel values at the same position as the template is detected. It may be detected as a corresponding subject area.

あるいは、キーフレームの各部分領域から色・テクスチャ・形状を表現する画像特徴量を抽出し、登録された各対象の画像特徴量と類似した画像特徴量をもつ部分領域を対応する被写体領域としてもよい。また、特定対象が人物の場合には、顔全体から得られる情報を利用する手法がある。例えば様々な顔の映っている画像をテンプレートとして記憶し、キーフレームとテンプレートの差分がある閾値以下のとき顔が入力画像中に存在すると判定する手法が挙げられる。また、肌色などの色情報や、エッジの方向や密度を組み合わせたモデルをあらかじめ記憶しておき、モデルに類似した領域を被写体領域として検出してもよい。更に、以下の手法のいずれかまたはその組合せを利用しても良い。顔（頭部）の輪郭を楕円、目や口を細長の形状をしていることを利用して作成したテンプレートを用いて検出する手法。頬や額の部分は輝度が高く、目や口の部分の輝度は低いという輝度分布の特性を利用する手法。顔の対称性や肌色領域と位置を利用する手法。 Alternatively, an image feature amount expressing color, texture, and shape is extracted from each partial region of the key frame, and a partial region having an image feature amount similar to the registered image feature amount of each target may be used as the corresponding subject region. Good. In addition, when the specific target is a person, there is a method of using information obtained from the entire face. For example, there is a method of storing an image showing various faces as a template and determining that the face is present in the input image when the difference between the key frame and the template is equal to or smaller than a threshold value. Alternatively, a model combining color information such as skin color, edge direction and density may be stored in advance, and an area similar to the model may be detected as a subject area. Furthermore, any of the following methods or a combination thereof may be used. A method of detecting the outline of the face (head) using an ellipse and a template created using the shape of an elongated eye and mouth. A technique that uses the characteristics of the luminance distribution that the cheeks and forehead are bright and the eyes and mouth are low. A technique that uses the symmetry and skin tone area and position of the face.

また、大量の人物顔と非顔の学習サンプルから得られた特徴量分布を統計的に学習し、入力画像から得られる特徴量が顔と非顔のどちらの分布に属するかを判定する手法（ニューラルネットやサポートベクターマシン、ＡｄａＢｏｏｓｔ法）など用いてもよい。また一般対象を検出する場合は、例えばNormalized Cutや、Saliency Map、Depth of Field （ＤｏＦ）を用いてもよい。Normalized Cutは、画像を複数の領域に分割する手法である。Jianbo Shi and Jitendra Malik, “Normalized Cuts and Image Segmentation”, IEEETransactions on Pattern Analysis and Machine Intelligence, vol.22, No.8, August2000にNormalized Cutについて詳しい説明がある。Normalized Cutにより分割された領域のうち、キーフレーム中央に位置する領域を被写体領域として検出してもよい。また、Saliency Mapにより高い重要度が算出された領域を被写体領域として検出してもよい。Saliency Mapは、画像中の物体領域を視覚注意から算出する方法である。Saliency Mapについては、L.Itti, C.Koch and E.Niebur,“ A Model of Saliency-based Visual Attention for Rapid Scene Analysis, ”IEEE Trans. Pattern Analysis and Machine Intelligence, Vol.20, No.11, pp.1254-1259, 1998に詳しい記載がある。 A method of statistically learning feature quantity distributions obtained from a large number of human face and non-face learning samples and determining whether the feature quantity obtained from the input image belongs to a face or non-face distribution ( Neural network, support vector machine, AdaBoost method) or the like may be used. When detecting a general target, for example, Normalized Cut, Saliency Map, or Depth of Field (DoF) may be used. Normalized Cut is a technique for dividing an image into a plurality of regions. Jianbo Shi and Jitendra Malik, “Normalized Cuts and Image Segmentation”, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol.22, No.8, August2000 has a detailed explanation of Normalized Cut. Of the areas divided by the Normalized Cut, an area located at the center of the key frame may be detected as a subject area. Alternatively, an area having a high importance calculated by the Saliency Map may be detected as a subject area. Saliency Map is a method of calculating an object region in an image from visual attention. For Saliency Map, see L. Itti, C. Koch and E. Niebur, “A Model of Saliency-based Visual Attention for Rapid Scene Analysis,” IEEE Trans. Pattern Analysis and Machine Intelligence, Vol. 20, No. 11, pp. .1254-1259, 1998 has a detailed description.

また、Ｄｏｆは、被写界深度内に存在する対象のエッジにはボケがなく、被写界深度外のエッジにボケが発生する特性に基づく手法である。詳しくは、３Du-Ming Tsai, Hu-Jong Wang, “Segmenting focused objects in complex visual images”, Pattern Recognition Letters, Vol.19, pp.929 940, 1998.に開示がある。エッジの太さをもとにボケ量を算出し、ボケの少ないエッジを結合し、焦点が合っている領域を対象領域として検出してもよい。 Dof is a method based on the characteristic that the target edge existing within the depth of field is not blurred and the edge outside the depth of field is blurred. Details are disclosed in 3Du-Ming Tsai, Hu-Jong Wang, “Segmenting focused objects in complex visual images”, Pattern Recognition Letters, Vol. 19, pp. 929 940, 1998. The blur amount may be calculated based on the thickness of the edge, the edges with less blur may be combined, and the focused area may be detected as the target area.

静止画像中の位置または視認性の高さ（照明条件、向き、角度、画面上での位置、他の対象による隠れ、ボケ、（人物の場合には）表情、等に基づく映りの良さを示す評価値）または複数画像での出現頻度をもとにキーフレームから対象領域を検出してもよい。また検出された複数の被写体領域を組み合わせて１つとしてもよい。キーフレームに対応した区間からの被写体領域の検出は、たとえば、キーフレーム中から検出された被写体領域の画像情報をテンプレートとし、キーフレームに対応する区間に属するいずれかのフレームから被写体領域を検出してもよい。 Shows the quality of the image based on the position in the still image or high visibility (lighting conditions, orientation, angle, position on the screen, hiding by other objects, blur, facial expression (in the case of a person), etc.) The target area may be detected from the key frame based on the evaluation value) or the appearance frequency in a plurality of images. A plurality of detected subject areas may be combined into one. To detect the subject area from the section corresponding to the key frame, for example, the image information of the subject area detected from the key frame is used as a template, and the subject area is detected from any frame belonging to the section corresponding to the key frame. May be.

区間内で検出された被写体領域のトラッキングは、たとえ以下の方法で実現できる。被写体領域が検出されたフレームを開始フレームとし、時間方向に隣接するフレームに対し被写体領域の検出処理を行う。被写体領域の検出に用いるテンプレートは、既に検出された対象領域の画像特徴量を用い、既に検出された対象領域の検出位置を中心に規定範囲の領域でテンプレートを走査させる。被写体領域の間の類似度は、各被写体領域から画像特徴量を抽出し、画像特徴量の差が小さいほど高い値を算出する尺度をもとに算出してもよい。画像特徴量は、被写体領域から検出された色、エッジ、テクスチャ等の画像情報をもとに算出できる。あるいは、各被写体の画像領域からＳＩＦＴ等の局所特徴点を検出し、画像領域間で特徴点を対応づけ、対応づけられた特徴点の個数が多い、もしくは、対応づけられた特徴点の位置関係が画像間で似ているほど高い値を算出する尺度を用いてもよい。被写体の発する音響情報の類似性は、例えば、被写体領域が検出された区間から、音響情報として複数の周波数帯の音響エネルギーを抽出し、音響エネルギーの差が小さいほど高い値を算出する尺度をもとに算出してもよい。上記のように、キーフレームに対応した区間の情報を用いることで、キーフレーム情報のみを用いた場合よりも、キーフレーム間で生じた被写体の見え方の変化や背景の変化に対して頑強に、被写体の同一性を判定できる。 Tracking of the subject area detected in the section can be realized by the following method. A frame in which the subject area is detected is set as a start frame, and subject area detection processing is performed on frames adjacent in the time direction. The template used for the detection of the subject area uses the image feature quantity of the target area that has already been detected, and scans the template in the area of the specified range centering on the detection position of the target area that has already been detected. The similarity between the subject areas may be calculated based on a scale that extracts an image feature amount from each subject region and calculates a higher value as the difference in the image feature amount is smaller. The image feature amount can be calculated based on image information such as a color, an edge, and a texture detected from the subject area. Alternatively, local feature points such as SIFT are detected from the image area of each subject, the feature points are associated between the image areas, and the number of associated feature points is large, or the positional relationship of the associated feature points A scale may be used that calculates a higher value as the images are more similar between images. The similarity of the acoustic information emitted by the subject is, for example, a scale that extracts acoustic energy in a plurality of frequency bands as acoustic information from the section in which the subject region is detected, and calculates a higher value as the difference in acoustic energy is smaller. And may be calculated as As described above, by using the information of the section corresponding to the key frame, it is more robust against changes in the appearance of the subject and changes in the background that occur between the key frames than when only the key frame information is used. The identity of the subject can be determined.

提示方法選択部２２２は、キーフレーム切り替え時のエフェクトまたはジングルを決定する。提示方法選択部２２２は、連続するキーフレームがお互いに関連性を有する場合には、関連性を有しない場合とは異なるエフェクトまたはジングルを決定する。提示方法選択部２２２は、映像コンテンツにおけるキーフレームの背景音楽を決定する。 The presentation method selection unit 222 determines an effect or jingle when switching key frames. The presentation method selection unit 222 determines an effect or jingle different from the case where there is no relationship when consecutive key frames are related to each other. The presentation method selection unit 222 determines the background music of the key frame in the video content.

具体的には、提示方法選択部２２２は、関連性判定部２２１から入力されるキーフレーム関連性情報と、あらかじめ登録された提示ルールをもとに、キーフレームの提示方法を決定する。提示方法選択部２２２は、キーフレームの提示方法を示す情報（提示方法情報）を、映像コンテンツ生成部２０３に入力する。提示方法情報とは、各キーフレームの提示方法を示すデータであり、キーフレームＩＤと提示時間を含むものとする。提示方法情報として、上記に加えてエフェクト、ＢＧＭ、音響ジングル、映像ジングルを保持してもよい。 Specifically, the presentation method selection unit 222 determines a key frame presentation method based on key frame relevance information input from the relevance determination unit 221 and a pre-registered presentation rule. The presentation method selection unit 222 inputs information (presentation method information) indicating a key frame presentation method to the video content generation unit 203. The presentation method information is data indicating a presentation method of each key frame, and includes a key frame ID and a presentation time. In addition to the above, effects, BGM, audio jingles, and video jingles may be held as presentation method information.

提示ルールとは、関連性種別に応じたキーフレームの提示方法を規定するルールである。提示ルールとして、連続するキーフレーム対の各提示時間を規定するパラメータを保持するものとする。また、提示時間に加えて、キーフレームの間に挿入するエフェクトやＢＧＭ、ジングル（短い映像や音楽、効果音）に関する制御パラメータを保持してもよい。また、連続するキーフレーム対にいずれの関連性種別も存在しない場合の提示方法を規定してもよい。提示ルールとして、例えば以下が挙げられる。 The presentation rule is a rule that defines a method for presenting a key frame in accordance with the relevance type. As a presentation rule, a parameter defining each presentation time of consecutive key frame pairs is held. In addition to the presentation time, control parameters related to effects, BGM, and jingles (short video, music, and sound effects) inserted between key frames may be held. Moreover, you may prescribe | regulate the presentation method in case no relevance type exists in a continuous key frame pair. Examples of the presentation rules include the following.

（１）提示時間に関するルール
提示方法決定部２０２は、少なくとも２つの部分動画または静止画像がお互いに関連性を有する場合には、一方の提示時間に基づいて他方の提示時間を決定する。具体的には、提示方法決定部２０２は、少なくとも２つの部分動画または静止画像に含まれる対象物が同一である場合には、前に挿入される部分動画または静止画像の提示時間よりも後に挿入される部分動画または静止画像の提示時間を短くする。一方、提示方法決定部２０２は、少なくとも２つの部分動画または静止画像がお互いに関連性を有しない場合には、独立に提示時間を決定する。 (1) Rule regarding presentation time When at least two partial moving images or still images are related to each other, the presentation method determination unit 202 determines the other presentation time based on one presentation time. Specifically, the presentation method determination unit 202 inserts after the presentation time of the partial moving image or still image inserted before when the objects included in at least two partial moving images or still images are the same. To shorten the presentation time of the partial video or still image. On the other hand, the presentation method determination unit 202 determines the presentation time independently when at least two partial moving images or still images are not related to each other.

言い換えれば、連続するキーフレーム対に含まれる被写体の同一性もしくは被写体に対する撮影者の興味の同一性をもとに、キーフレーム対の提示時間を決定する。例えば、連続するキーフレーム対に含まれる被写体が同一、もしくは被写体に対する撮影者の興味が同一の場合には、はじめに提示されるキーフレームの提示時間を初期値Ｔｓとし、Ｔｓを基準に後続のキーフレームの提示時間を決定してもよい。また、同一の被写体もしくは同一の興味で撮影された被写体を含むキーフレーム群のうち、視認性の高いキーフレームの提示時間をＴｐとし、Ｔｐを基準に後続のキーフレームの提示時間を決定してもよい。また、同一の被写体もしくは同一の興味で撮影された被写体を含むキーフレーム群のうち、提示時間がＴｑ以下になったキーフレームの次のキーフレームの提示時間を初期値Ｔｓとし、Ｔｓを基準に後続のキーフレームの提示時間を決定してもよい。また、同一の被写体もしくは同一の興味で撮影された被写体を含むキーフレーム群のうち、最後に提示されるキーフレームの提示時間を初期値Ｔｓに設定してもよい。ＴｓとＴｐの値は、あらかじめ映像コンテンツ全体の時間を設定しておき、提示するキーフレーム数に応じて算出してもよい。また、連続するキーフレーム対に含まれる被写体が同一でない、もしくは、同一の興味で撮影された被写体でない場合には、前のキーフレームの提示時間と独立に後続の提示時間を決定する。例えば、初期値Ｔｓに設定してもよいし、規定範囲内のランダムな値に設定してもよい。 In other words, the presentation time of the key frame pair is determined based on the identity of the subject included in the successive key frame pairs or the identity of the photographer's interest in the subject. For example, when the subjects included in consecutive key frame pairs are the same or the photographer's interest in the subject is the same, the presentation time of the key frame presented first is set as the initial value Ts, and the subsequent key is based on Ts. The frame presentation time may be determined. Further, among key frame groups including the same subject or subjects photographed with the same interest, the presentation time of the key frame with high visibility is set as Tp, and the presentation time of the subsequent key frame is determined based on Tp. Also good. In addition, among key frame groups including the same subject or subjects photographed with the same interest, the presentation time of the key frame next to the key frame whose presentation time is equal to or less than Tq is set as an initial value Ts, and Ts is used as a reference. The presentation time of subsequent key frames may be determined. In addition, among key frame groups including the same subject or a subject photographed with the same interest, the presentation time of the key frame presented last may be set as the initial value Ts. The values of Ts and Tp may be calculated according to the number of key frames to be presented by setting the time of the entire video content in advance. If the subjects included in the consecutive key frame pairs are not the same or are not taken with the same interest, the subsequent presentation time is determined independently of the presentation time of the previous key frame. For example, the initial value Ts may be set, or a random value within a specified range may be set.

例として、図３を用いて、人物Ａの周囲を回りながら撮影した動画から５枚のキーフレーム３０１〜３０５を抽出した場合について説明する。ここではキーフレーム３０１〜３０５に含まれる被写体が同一であるため、上述のように、最初のキーフレームの提示時間Ｔｓに対して、パラメータａを乗算することによって、後続のキーフレームの提示時間を算出する。このとき、始めのキーフレーム３０１の提示時間を初期値Ｔｓとすると、後続のキーフレーム３０２の提示時間Ｔｉは以下の式（１）で表わされる。

さらに、正面を向いたキーフレーム３０３で視認性の評価値が閾値以上の場合、キーフレーム３０３の提示時間はＴｐ、後続のキーフレーム３０４、３０５の提示時間Ｔｊは以下の式（１）で表わされる。

パラメータａを０から１の間に設定すると、人物Ａを含むキーフレームのうち初めに提示されたキーフレーム３０１と人物Ａの映りがよいキーフレーム３０３は長く提示され、キーフレーム３０１、３０３から遠ざかるに従って徐々に短く提示される。 As an example, a case will be described in which five key frames 301 to 305 are extracted from a moving image shot around the person A with reference to FIG. Here, since the subjects included in the key frames 301 to 305 are the same, as described above, the presentation time of the subsequent key frame is obtained by multiplying the presentation time Ts of the first key frame by the parameter a. calculate. At this time, when the presentation time of the first key frame 301 is an initial value Ts, the presentation time Ti of the subsequent key frame 302 is expressed by the following equation (1).

Further, when the visibility evaluation value of the key frame 303 facing the front is equal to or greater than the threshold value, the presentation time of the key frame 303 is represented by Tp, and the presentation times Tj of the subsequent

key frames

304 and 305 are represented by the following formula (1). It is.

When the parameter a is set between 0 and 1, among the key frames including the person A, the key frame 301 presented first and the key frame 303 with good reflection of the person A are presented long and move away from the

key frames

301 and 303. Will be presented gradually and according to.

これにより、利用者は、対象が初めて登場した瞬間や映りのよいキーフレームの内容を理解し、その他の画像は理解した内容とほぼ同様の内容が映っていると理解することができるという効果がある。また、同じ対象を含む画像であっても連続する画像の提示時間が変化する映像を生成できるため、視聴者を飽きさせないテンポ感ある映像コンテンツが生成できるという効果がある。 As a result, the user can understand the moment when the object first appears and the contents of the keyframes with good reflection, and the other images can understand that the contents are almost the same as the understood contents. is there. In addition, even if the images include the same target, it is possible to generate a video in which the presentation time of successive images changes, so that it is possible to generate video content with a sense of tempo that does not bore viewers.

（２）エフェクト・ＢＧＭ・ジングルに関するルール
提示方法選択部２２２は、連続するキーフレーム対に含まれる被写体の同一性、もしくは被写体に対する撮影者の興味の同一性をもとに、キーフレーム対の間に挿入するエフェクト・ＢＧＭ・ジングルを決定する。例えば、連続するキーフレーム対に含まれる被写体が同一もしくは被写体に対する撮影者の興味が同一の場合には、キーフレームの切り替え時に視覚的な変化の少ないエフェクトとしてあらかじめ登録された特殊効果（ディゾルブやフェード等）を挿入する。同一でない場合には、キーフレームの切り替え時に視覚的な変化の大きいエフェクトとしてあらかじめ登録された特殊効果（ページめくり、ワイプ等のＤＶＥ）を挿入する。また例えば、連続するキーフレーム対に含まれる被写体が同一もしくは同一の興味で撮影された被写体の場合には、キーフレーム対の提示中に同じＢＧＭを流し、同一でない場合には、キーフレームの切り替え時にＢＧＭを止めるもしくは異なるＢＧＭに切り替える。また、被写体が同一でない、もしくは同一の興味で撮影された被写体を含まないキーフレームの間に、映像や音響のジングルを挿入してもよい。これにより、同一の被写体もしくは同一の興味で撮影された被写体を含むキーフレーム群は、画像や音響的な変化がなく滑らかに接続される。 (2) Rules for Effects, BGM, and Jingles The presentation method selection unit 222 determines whether the key frame pair is based on the identity of the subject included in the continuous key frame pair or the identity of the photographer's interest in the subject. Determine the effect, BGM, and jingle to be inserted. For example, if the subject included in a pair of consecutive key frames is the same or the photographer's interest in the subject is the same, special effects (dissolve and fade) registered in advance as effects with little visual change when switching key frames. Etc.). If they are not the same, a special effect (DVE such as page turning or wipe) registered in advance as an effect having a large visual change at the time of switching key frames is inserted. Further, for example, when the subjects included in the continuous key frame pairs are subjects photographed with the same or the same interest, the same BGM is played during the presentation of the key frame pairs, and when they are not the same, the key frames are switched. Sometimes stop BGM or switch to a different BGM. Also, a video or audio jingle may be inserted between key frames that do not include subjects that are not the same or are photographed with the same interest. As a result, the key frame group including the same subject or the subject photographed with the same interest is smoothly connected without any image or acoustic change.

そのため、視聴者は、キーフレームが互いに同一の被写体もしくは同一の重要度の被写体が含まれており、互いに一連の変化や動作の途中の画像であるか同一の意図で撮影された被写体を含む画像であることを容易に理解できる。また、キーフレーム中の被写体が同一でない、あるいは、同一の意図で撮影された被写体を含まない場合には、画像や音響的な変化が大きく変化する。そのため、視聴者はキーフレームの内容が大きく変化したことに気づき、新規映像の理解に集中することができる。 Therefore, the viewer includes subjects with the same key frame or subjects with the same importance, and images that are in the middle of a series of changes or operations, or include subjects that are taken with the same intention. It can be easily understood. In addition, when the subject in the key frame is not the same or does not include the subject photographed with the same intention, the image and the acoustic change greatly change. Therefore, the viewer can notice that the contents of the key frame have changed greatly and can concentrate on understanding the new video.

映像コンテンツ生成部２０３は、提示方法選択部２２２で選択された提示方法情報と、キーフレーム抽出部２０１から入力されるキーフレーム情報をもとに、新規な映像コンテンツを生成し、出力する。 The video content generation unit 203 generates and outputs a new video content based on the presentation method information selected by the presentation method selection unit 222 and the key frame information input from the key frame extraction unit 201.

（動作）
次に、図４のフローチャートを参照して、本実施の形態の動作について詳細に説明する。ここでは一例として図５に示すキーフレーム５０１〜５１３を抽出して、映像コンテンツを生成する場合を考える。この映像コンテンツは、建物の中にある温室で、花と人物とを撮影した出来事を伝えるものである。関連性判定部２２１によって各キーフレームから検出される対象領域を図５に矩形で示す。 (Operation)
Next, the operation of the present embodiment will be described in detail with reference to the flowchart of FIG. Here, as an example, consider the case where key frames 501 to 513 shown in FIG. 5 are extracted to generate video content. This video content conveys the event of shooting a flower and a person in the greenhouse in the building. The target area detected from each key frame by the relevance determination unit 221 is indicated by a rectangle in FIG.

また、提示ルールとして、対象領域が同一であるキーフレーム対については大小関係もしくは部分関係に基づくルールを用い、対象領域が同一でないキーフレーム対については同種性に基づくルールを用いて提示方法を制御する。なお、大小関係、部分関係、同種性に基づくルールについては、第３実施形態以降に詳しく説明する。 Also, as a presentation rule, a key frame pair with the same target area uses a rule based on the magnitude relationship or partial relation, and a key frame pair with the same target area uses a rule based on homogeneity to control the presentation method. To do. The rules based on the magnitude relationship, the partial relationship, and the homogeneity will be described in detail in the third and subsequent embodiments.

まず、ステップＳ４０１において映像入力部２０４はソースとなる動画を入力する。ステップＳ４０２において、入力したソース動画は、キーフレーム抽出部２０１に渡され、キーフレームの抽出が行なわれる。 First, in step S401, the video input unit 204 inputs a moving image serving as a source. In step S402, the input source moving image is transferred to the key frame extraction unit 201, where key frames are extracted.

ステップＳ４０３では、キーフレーム情報がキーフレーム抽出部２０１から提示方法決定部２０２に渡され、映像入力部２０４からも映像情報が提示方法決定部２０２に渡される。関連性判定部２２１は、キーフレーム５０１〜５１３が抽出されたソース動画を参照して、被写体の同一性や、撮影方法の共通性を判定する。 In step S 403, the key frame information is passed from the key frame extraction unit 201 to the presentation method determination unit 202, and the video information is also passed from the video input unit 204 to the presentation method determination unit 202. The relevancy determination unit 221 determines the identity of the subject and the commonality of the shooting method with reference to the source moving image from which the key frames 501 to 513 are extracted.

さらに、ステップＳ４０３において、関連性判定部２２１は、キーフレーム５０１〜５１３から対象領域を検出する。関連性判定部２２１には、対象としてあらかじめ建造物・草花・人が登録されており、それぞれのモデルが学習されているものとする。そして、キーフレーム５０１〜５１３からそれぞれ、建造物の対象領域として実線矩形で囲まれた箇所を検出する。 Furthermore, in step S403, the relevance determination unit 221 detects the target area from the key frames 501 to 513. Assume that in the relevance determination unit 221, buildings, flowers, and people are registered in advance, and the respective models are learned. And the location enclosed with the solid-line rectangle as a target area | region of a building is detected from the key frames 501-513, respectively.

ステップＳ４０５では、対象領域０および対象領域１の画素情報から画像特徴量を抽出し、領域間の類似性をもとに同一性・大小関係・部分関係・同種性を判定する。対象領域０と１は、建造物の種別として検出されているため、同種性ありとなる。また、キーフレーム５０１上の破線矩形の領域が、対象領域１と対象領域０の共通領域として検出され、対象領域１と０は大小関係にあることが分かる。また共通領域以外の領域が対象領域０上には存在しないため、部分関係の関係性はないと判定される。よって、キーフレーム５０１におけるキーフレーム５０２との間の関連性フラグは、同一性・大小関係・部分関係・同種性の順に１、−１、０、１となる。 In step S405, image feature amounts are extracted from the pixel information of the target area 0 and the target area 1, and the identity, magnitude relationship, partial relationship, and homogeneity are determined based on the similarity between the areas. Since the target areas 0 and 1 are detected as the types of buildings, they have the same type. Also, a broken-line rectangular area on the key frame 501 is detected as a common area of the target area 1 and the target area 0, and it can be seen that the target areas 1 and 0 are in a size relationship. Further, since there is no area other than the common area on the target area 0, it is determined that there is no partial relationship. Therefore, the relationship flag between the key frame 501 and the key frame 502 is 1, −1, 0, 1 in the order of identity, magnitude relationship, partial relationship, and homogeneity.

提示方法選択部２２２は、画像関連性情報として、画像ＩＤと関連性フラグをもとに提示方法を選択する。例えば、キーフレーム５０１とキーフレーム５０２の対象領域は同一のため、大小関係もしくは部分関係に基づくルールを適用する。開始画像であるキーフレーム５０１の提示時間は初期値Ｔｓ、キーフレーム５０１、５０２の大小関係が小・大の関係であるため、キーフレーム５０２の提示時間は、ａ＊Ｔｓとする。また、キーフレーム５０１、５０２に大小関係があることから、キーフレーム５０１、５０２の切り替わりのエフェクトとして視覚的変化の少ないディゾルブを挿入する（ステップＳ４０７）。 The presentation method selection unit 222 selects a presentation method based on the image ID and the relevance flag as the image relevance information. For example, since the target areas of the key frame 501 and the key frame 502 are the same, a rule based on a magnitude relationship or a partial relationship is applied. Since the presentation time of the key frame 501 that is the start image is the initial value Ts and the magnitude relationship between the key frames 501 and 502 is small / large, the presentation time of the key frame 502 is a * Ts. Further, since there is a magnitude relationship between the key frames 501 and 502, a dissolve with little visual change is inserted as an effect of switching between the key frames 501 and 502 (step S407).

映像コンテンツ生成部２０３は、決定した提示時間・エフェクトで、キーフレーム５０１、５０２を用いて映像コンテンツを生成する（ステップＳ４０９）。 The video content generation unit 203 generates video content using the key frames 501 and 502 with the determined presentation time / effect (step S409).

キーフレームから検出される対象領域の種別６０１、各関連性種別に対する関連性フラグ６０２、提示方法決定部２０２によって決定される提示時間長６０３およびエフェクト６０４を図６に示す。 The target area type 601 detected from the key frame, the relevance flag 602 for each relevance type, the presentation time length 603 determined by the presentation method determining unit 202, and the effect 604 are shown in FIG.

本実施形態によれば、入力映像から抽出されたキーフレームを用いて、キーフレーム中の被写体の入力映像中での意味的な関連性を理解しやすい新規映像を生成することができる。 According to the present embodiment, it is possible to generate a new video that can easily understand the semantic relevance of the subject in the key frame in the input video by using the key frame extracted from the input video.

［第３実施形態］
第２実施形態に開示した関連性１に代えて、あるいは加えて以下の関連性のいずれか１つの変化に応じてそれぞれ提示方法を変更してもよい。 [Third Embodiment]
Instead of or in addition to the relevance 1 disclosed in the second embodiment, the presentation method may be changed according to any one of the following relevance changes.

（関連性２．動画撮影方法）
ソース動画中で撮影者が被写体を撮影した撮影方法に関連性のあるキーフレーム群を抽出した場合、その撮影方法の関連性に応じた提示方法で、それらのキーフレーム群を提示する。例えば、連続する複数のキーフレームが、いずれもフォロー撮影（被写体の動きに追従するようにカメラを動かして撮影する技法）で撮影された動画部分から抽出された場合、それらのキーフレーム群は関連性があると判定する。 (Relevance 2. Movie shooting method)
When a group of key frames relevant to the photographing method in which the photographer photographed the subject is extracted from the source video, the key frames are presented by a presentation method according to the relevance of the photographing method. For example, if multiple consecutive keyframes are extracted from the video part that was shot by follow shooting (a technique that moves the camera to follow the movement of the subject), these keyframes are related. Judge that there is sex.

同様に、連続する複数のキーフレームが、いずれもズーム撮影で撮影された動画部分から抽出された場合や、いずれも一定時間以上静止撮影された動画部分から抽出された場合にも、それらのキーフレーム群にはある種の関連性、共通性が存在すると判断する。関連性２についての関連性フラグとしては、撮影方法に関連性がある場合に１を、ない場合に０を設定する。 Similarly, when a plurality of consecutive key frames are all extracted from a moving image portion shot by zoom shooting, or all of them are extracted from a moving image portion shot still for a certain period of time or more, those keys are used. It is determined that the frame group has some kind of relationship and commonality. As the relevance flag for relevance 2, 1 is set when there is a relevance to the imaging method, and 0 is set when there is no relevance.

連続するキーフレーム対それぞれから被写体領域を検出し、この被写体領域の情報をもとにキーフレームに対応する区間からそれぞれ被写体領域を検出する。被写体領域および背景領域の動きベクトルを解析し、被写体がフォロー、ズーム、静止等の意図的な方法で撮影されていることを判定する。 A subject area is detected from each pair of consecutive key frames, and each subject area is detected from a section corresponding to the key frame based on information on the subject area. The motion vectors of the subject area and the background area are analyzed, and it is determined that the subject is photographed by an intentional method such as follow, zoom, or still.

区間内の被写体領域の撮影方法は、例えば以下の方法で判定できる。被写体がフォロー撮影されていることは、特許第４５９３３１４公報に開示されたフォロー対象判定方式によって判定できる。また、被写体がズームインもしくは静止撮影されていることは、特開２００７−１９８１４号公報のカメラモーション判定方式によって判定できる。 The method for photographing the subject area in the section can be determined by the following method, for example. It can be determined by the follow target determination method disclosed in Japanese Patent No. 4593314 that the subject has been photographed following. Whether the subject is zoomed in or taken still can be determined by the camera motion determination method disclosed in Japanese Patent Application Laid-Open No. 2007-19814.

〔動画撮影方法の共通性に応じたルール〕
（２−１）提示時間に関するルール
連続するキーフレーム対のソース動画の撮影方法の同一性をもとに、キーフレーム対の提示時間を決定する。例えば、同一の撮影方法で撮影されたキーフレームが連続した場合、徐々に提示時間を短くしていく。すなわち、同一の方法で撮影されたキーフレーム群のうち、はじめに提示されるキーフレームの提示時間を初期値Ｔｓとし、Ｔｓを基準として後続のキーフレームの提示時間を決定する。また、同一の方法で撮影されたキーフレーム群のうち、視認性の高いキーフレームの提示時間をＴｐとし、Ｔｐを基準に後続のキーフレームの提示時間を決定してもよい。また、同一の方法で撮影されたキーフレーム群のうち、キーフレームの提示時間がＴｑ以下になった次のキーフレームの提示時間を初期値Ｔｓとし、Ｔｓを基準に後続のキーフレームの提示時間を決定してもよい。また、同一の方法で撮影されたキーフレーム群のうち、最後に提示される画像の提示時間をＴｓに設定してもよい。Ｔｓ，Ｔｐの値は、あらかじめ映像コンテンツ全体の提示時間を設定しておき、提示する画像数に応じて算出してもよい。連続するキーフレーム対が異なる方法で撮影された場合には、前のキーフレームの提示時間と独立に後続の提示時間を決定する。例えば、初期値Ｔｓに設定してもよいし、規定範囲内のランダムな値に設定してもよい。 [Rules based on commonality of video shooting methods]
(2-1) Rules for Presentation Time The presentation time for key frame pairs is determined based on the identity of the source video capturing method for successive key frame pairs. For example, when key frames shot with the same shooting method are consecutive, the presentation time is gradually shortened. That is, among the key frame groups photographed by the same method, the presentation time of the first key frame presented is set as the initial value Ts, and the presentation time of the subsequent key frame is determined based on Ts. Moreover, the presentation time of the key frame with high visibility among the key frame groups photographed by the same method may be Tp, and the presentation time of the subsequent key frame may be determined based on Tp. In addition, among the key frame groups photographed by the same method, the presentation time of the next key frame in which the presentation time of the key frame is equal to or less than Tq is set as the initial value Ts, and the presentation time of the subsequent key frame based on Ts. May be determined. Also, the presentation time of the image presented last among the key frame groups photographed by the same method may be set to Ts. The values of Ts and Tp may be calculated according to the number of images to be presented by setting the presentation time of the entire video content in advance. When successive key frame pairs are photographed in different ways, the subsequent presentation time is determined independently of the previous key frame presentation time. For example, the initial value Ts may be set, or a random value within a specified range may be set.

（２−２）エフェクト・ＢＧＭ・ジングルに関するルール
連続するキーフレーム対の撮影方法の同一性をもとに、キーフレーム対の間に挿入するエフェクト・ＢＧＭ・ジングルを決定する。例えば、連続するキーフレーム対が同じ方法で撮影された場合には、キーフレームの切り替え時に視覚的な変化の少ないエフェクトとしてあらかじめ登録された特殊効果（ディゾルブやフェード等）を挿入する。異なる方法で撮影された場合には、キーフレームの切り替え時に視覚的な変化の大きいエフェクトとしてあらかじめ登録された特殊効果（ページめくり、ワイプ等のＤＶＥ）を挿入する。また例えば、連続するキーフレーム対が同じ方法で撮影された場合には、キーフレーム対の提示中同じＢＧＭを流し、異なる方法で撮影された場合には、キーフレームの切り替え時にＢＧＭを止めるもしくは異なるＢＧＭに切り替える。また、異なる方法で撮影されたキーフレーム間に、ジングルを挿入してもよい。これにより、連続するキーフレーム対が同じ方法で撮影された場合、画像や音響的な変化がなく滑らかに接続される。そのため、視聴者は、キーフレームが互いに変化がなくほぼ同じ内容であることを容易に理解できる。また、異なる方法で撮影された場合には、画像や音響的な変化が大きく変化するため、視聴者は内容に変化があることに気づき映像コンテンツの内容理解に集中することができる。 (2-2) Rules Regarding Effects, BGM, and Jingles Based on the identity of consecutive key frame pairs, the effects, BGM, and jingles to be inserted between key frame pairs are determined. For example, when consecutive key frame pairs are photographed by the same method, special effects (such as dissolves and fades) registered in advance as effects with little visual change are inserted when switching key frames. When the image is shot by a different method, a special effect (DVE such as page turning or wipe) registered in advance as an effect having a large visual change at the time of switching key frames is inserted. Also, for example, when consecutive key frame pairs are photographed in the same method, the same BGM is played during presentation of the key frame pair, and when photographed in a different method, the BGM is stopped or different when the key frame is switched. Switch to BGM. A jingle may be inserted between key frames photographed by different methods. As a result, when consecutive key frame pairs are photographed in the same manner, they are smoothly connected without any image or acoustic change. Therefore, the viewer can easily understand that the key frames have almost the same contents without any change. In addition, when the image is taken by a different method, the image and the acoustic change greatly change, so that the viewer can notice that the content is changed and can concentrate on understanding the content of the video content.

（関連性３．対象の大小関係）
関連性判定部２２１は、キーフレーム同士の大小関係を、キーフレームを抽出したソース動画でのズーム撮影の有無および対象領域の面積によって決定してもよい。このように決定された関連性を関連性３と称する。 (Relevance 3. Target size relationship)
The relevancy determination unit 221 may determine the magnitude relationship between the key frames based on the presence / absence of zoom shooting in the source moving image from which the key frames are extracted and the area of the target region. The relationship thus determined is referred to as relationship 3.

「対象の大小関係にある」とは、映像コンテンツ内で連続するキーフレーム対に含まれる対象が同一であり、かつ対象領域の面積に規定値以上の差があることである。例えば、対象の周囲を含んだ画像と、対象のみを撮影した画像を組み合わせて映像コンテンツを生成することで、対象を紹介するケースがある。 The “target size relationship” means that the targets included in consecutive key frame pairs in the video content are the same, and the area of the target region has a difference greater than a specified value. For example, there is a case where a target is introduced by generating a video content by combining an image including the periphery of the target and an image obtained by photographing only the target.

対象の大小関係は、同一と判定された対象領域に共通する部分領域の面積、または共通する部分領域に含まれる特徴点間の距離によって判定できる。例えば、特徴点間の距離が大であるほど対象が大きく撮影されていると判定できる。映像コンテンツ内で連続するキーフレーム対の間で同一と判定された対象領域の間で判定してもよい。この場合、関連性３についての関連性フラグには、あるキーフレーム中の対象領域の面積よりもその次のキーフレーム中の対象領域の面積の方が大きい場合には１を、小さい場合には−１を、大小関係が存在しない場合には０を設定する。あるいは、映像コンテンツに含まれる全キーフレームから検出された対象領域のうち同一と判定された対象領域に共通する部分領域の面積もしくは特徴点間距離を比較して決定してもよい。例えば、同一と判定された対象領域に共通する部分領域の最大面積Ｓｍａｘと最小面積Ｓｍｉｎをもとに、（Ｓｍａｘ＋２Ｓｍｉｎ）／３よりも小さい同一の対象領域を小、（Ｓｍａｘ＋２Ｓｍｉｎ）／３よりも大きく（２Ｓｍａｘ＋Ｓｍｉｎ）／３より小さい同一の対象領域を中、（２Ｓｍａｘ＋Ｓｍｉｎ）／３よりも大きい同一の対象領域を大とする。関連性フラグには、連続するキーフレーム中の対象領域が、小と中もしくは中と大の関係であれば１を、小と大の関係であれば２を、大と中もしくは中と小の関係であれば−１を、大と小の関係であれば−２を、大小関係が存在しない場合に０を設定する。 The magnitude relationship between objects can be determined based on the area of partial areas common to the target areas determined to be the same or the distance between feature points included in the common partial areas. For example, it can be determined that the larger the distance between feature points, the larger the object is photographed. The determination may be made between target areas determined to be identical between consecutive key frame pairs in the video content. In this case, the relevance flag for relevance 3 is set to 1 when the area of the target area in the next key frame is larger than the area of the target area in a certain key frame, and to 1 when the area is smaller. -1 is set to 0 when there is no magnitude relationship. Or you may determine by comparing the area of the partial area | region common to the object area | region determined to be the same among the object area | regions detected from all the key frames contained in video content, or the distance between feature points. For example, based on the maximum area Smax and the minimum area Smin of the partial areas common to the target areas determined to be the same, the same target area smaller than (Smax + 2Smin) / 3 is smaller and larger than (Smax + 2Smin) / 3 The same target area smaller than (2Smax + Smin) / 3 is set as medium, and the same target area larger than (2Smax + Smin) / 3 is set as large. The relevance flag includes 1 if the target area in consecutive keyframes has a small-medium or medium-large relationship, 2 if a small-to-large relationship, or large-medium or medium-small If the relationship is -1, -1 is set. If the relationship is large and small, -2 is set. If there is no size relationship, 0 is set.

〔対象の大小関係に応じたルール〕
（３−１）提示時間に関するルール
連続するキーフレーム対に含まれる対象の大小関係をもとに、キーフレーム対の提示時間を決定する。例えば、対象の大小関係にあるキーフレーム群のうち、はじめに提示されるキーフレームの提示時間を初期値Ｔｓとし、Ｔｓを基準として後続のキーフレームの提示時間を決定する。また、対象の大小関係にあるキーフレーム群のうち、視認性の高いキーフレームの提示時間をＴｐとし、Ｔｐを基準に後続のキーフレームの提示時間を決定してもよい。また、大小関係にあるキーフレーム群のうち、キーフレームの提示時間がＴｑ以下になった次のキーフレームの提示時間を初期値Ｔｓとし、Ｔｓを基準に後続のキーフレームの提示時間を決定してもよい。また、大小関係にあるキーフレーム群のうち最後に提示されるキーフレームの提示時間をＴｓに設定してもよい。Ｔｓ，Ｔｐの値は、あらかじめ映像コンテンツ全体の提示時間を設定しておき、提示する画像数に応じて算出してもよい。また、連続するキーフレーム対に含まれる対象に大小関係がない場合には、前のキーフレームの提示時間と独立に後続の提示時間を決定する。例えば、初期値Ｔｓに設定してもよいし、規定範囲内のランダムな値に設定してもよい。 [Rules according to the target size]
(3-1) Rules for Presentation Time The presentation time of a key frame pair is determined based on the size relationship of objects included in successive key frame pairs. For example, the presentation time of the next key frame is determined using Ts as the initial value Ts and the presentation time of the key frame presented first among the key frame groups in the target size relationship. Moreover, the presentation time of the key frame with high visibility among the key frame groups in the size relationship of the target may be Tp, and the presentation time of the subsequent key frame may be determined based on Tp. In addition, among the key frame groups having a magnitude relationship, the presentation time of the next key frame whose presentation time of the key frame is equal to or less than Tq is set as the initial value Ts, and the presentation time of the subsequent key frame is determined based on Ts. May be. Further, the presentation time of the key frame presented last among the key frame groups having a magnitude relationship may be set to Ts. The values of Ts and Tp may be calculated according to the number of images to be presented by setting the presentation time of the entire video content in advance. Further, when there is no magnitude relationship between the objects included in the consecutive key frame pairs, the subsequent presentation time is determined independently of the presentation time of the previous key frame. For example, the initial value Ts may be set, or a random value within a specified range may be set.

図７を用いて様々な大きさの対象Ｂを撮影したキーフレームを再生させる場合について説明する。連続するキーフレーム間の大小関係は、映像コンテンツに含まれる全キーフレームから検出された対象領域のうち同一と判定された対象領域の間の面積を比較して決定したとする。また、あるキーフレームの提示時間に対し、関連性フラグ分パラメータａを乗算させることで、次のキーフレームの提示時間を算出するとする。このとき、始めのキーフレーム７０１の提示時間を初期値Ｔｓ、キーフレーム７０１、７０２は小と中の関係、キーフレーム７０２、７０３は中と大の関係、キーフレーム７０３、７０４は大と小の関係とする。このとき、キーフレーム７０１、７０２の関連性フラグが１のため、キーフレーム７０２の提示時間はａＴｓとなる（ａの乗算）。さらにキーフレーム７０３の関連性フラグが１のため、キーフレーム７０３の提示時間はａ×ａ×Ｔｓとなる（ａの乗算）。キーフレーム７０３、７０４の関連性フラグは−２のため、７０４の提示時間はＴｓとなる（ａ×ａの除算）。パラメータａを０から１の間に設定すると、対象Ｂが小さく撮影されたキーフレーム（ロングショット）が長く、対象Ｂがより大きく撮影されたキーフレーム（ミドルショット、タイトショット）は短く提示される。 A case where key frames obtained by shooting the target B of various sizes are reproduced will be described with reference to FIG. It is assumed that the magnitude relationship between successive key frames is determined by comparing the areas between target regions determined to be the same among target regions detected from all key frames included in the video content. Also, it is assumed that the presentation time of the next key frame is calculated by multiplying the presentation time of a key frame by the relevance flag parameter a. At this time, the presentation time of the first key frame 701 is the initial value Ts, the key frames 701 and 702 are small and medium, the key frames 702 and 703 are medium and large, and the key frames 703 and 704 are large and small. It is related. At this time, since the relevance flag of the key frames 701 and 702 is 1, the presentation time of the key frame 702 is aTs (multiplication of a). Further, since the relevance flag of the key frame 703 is 1, the presentation time of the key frame 703 is a × a × Ts (multiplication of a). Since the relevance flag of the key frames 703 and 704 is −2, the presentation time of 704 is Ts (a × a division). When the parameter a is set between 0 and 1, a key frame (long shot) in which the subject B is photographed small is long and a key frame (middle shot and tight shot) in which the subject B is photographed larger is presented short. .

これにより、利用者は、対象Ｂ以外の情景が映りこんだ情報量の多いキーフレームについて内容を理解し、以降の内容が前のキーフレームの一部分であることを直感的に理解することができる。また、同じ対象を含む画像であっても、連続する画像の提示時間が変化する映像を生成することができるため、視聴者を飽きさせないテンポ感ある映像コンテンツが生成できるという効果がある。 As a result, the user can understand the contents of a key frame with a large amount of information in which a scene other than the target B is reflected, and can intuitively understand that the subsequent contents are a part of the previous key frame. . In addition, even if the images include the same target, it is possible to generate a video in which the presentation time of successive images changes, so that it is possible to generate video content with a sense of tempo that does not bore the viewer.

（３−２）エフェクト・ＢＧＭ・ジングルに関するルール
連続するキーフレーム対に含まれる対象の大小関係をもとに、キーフレーム対の間に挿入するエフェクト・ＢＧＭ・ジングルを決定する。例えば、連続するキーフレーム対に含まれる対象が大小関係にある場合には、キーフレームの切り替え時に視覚的な変化の少ないエフェクトとしてあらかじめ登録された特殊効果（ディゾルブやフェード等）を挿入する。大小関係にない場合には、キーフレームの切り替え時に視覚的な変化の大きいエフェクトとしてあらかじめ登録された特殊効果（ページめくり、ワイプ等のＤＶＥ）を挿入する。また例えば、連続するキーフレーム対に含まれる対象が大小関係にある場合には、キーフレーム対の提示中同じＢＧＭを流し、同一でない場合には、キーフレームの切り替え時にＢＧＭを止めるもしくは異なるＢＧＭに切り替える。 (3-2) Rules for Effect / BGM / Jingle The effect / BGM / jingle to be inserted between the key frame pairs is determined based on the size relationship of the objects included in the consecutive key frame pairs. For example, when the objects included in the consecutive key frame pairs are in a magnitude relationship, special effects (such as dissolves and fades) registered in advance as effects with little visual change when key frames are switched are inserted. If there is no size relationship, a special effect (DVE such as page turning or wipe) registered in advance as an effect having a large visual change at the time of switching the key frame is inserted. Also, for example, if the objects included in the consecutive key frame pairs are in a size relationship, the same BGM is played during the presentation of the key frame pairs, and if they are not the same, the BGM is stopped at the time of switching the key frame or is changed to a different BGM. Switch.

また、大小関係が存在しない画像間に、ジングルを挿入してもよい。これにより、大小関係の対象を撮影したキーフレーム群は、画像や音響的な変化がなく滑らかに接続される。そのため、視聴者は、キーフレームが互いに変化がなくほぼ同じ内容であることを容易に理解できる。また、大小関係に無い場合には、画像や音響的な変化が大きく変化するために、視聴者は内容に変化があったことに気づき映像コンテンツの内容理解に集中することができる。 Also, jingles may be inserted between images that do not have a magnitude relationship. Thereby, the key frame group which image | photographed the object of magnitude relationship is connected smoothly, without an image and an acoustic change. Therefore, the viewer can easily understand that the key frames have almost the same contents without any change. In addition, when there is no size relationship, image and acoustic changes greatly change, so that the viewer can notice that there is a change in the content and can concentrate on understanding the content of the video content.

（関連性４．対象の部分関係）
関連性判定部２２１は、関連性を、２つのキーフレームに表わされた対象の部分関係によって決定してもよい。つまり、キーフレーム対に含まれる２つのキーフレームに表わされた対象が全体と部分との関係にあるかによって決定してもよい。このように決定された関連性を関連性４と称する。 (Relevance 4. Target partial relationship)
The relevancy determination unit 221 may determine the relevance based on the partial relationship of the target represented in the two key frames. That is, the determination may be made depending on whether the object represented in the two key frames included in the key frame pair is in a relationship between the whole and the part. The relationship thus determined is referred to as relationship 4.

「対象の部分関係にある」とは、対象映像コンテンツ内で連続するキーフレーム対に映っている対象が同一であり、かつ互いに異なる対象の部分を撮影した画像となっている関係を示す。例えば、広い景色や、大きな対象や、長い対象を撮影したい場合に、対象の一部を撮影したキーフレームを組み合わせて映像コンテンツ再生することで全体を表現するケースがこれにあたる。 “In a partial relationship of objects” indicates a relationship in which the objects shown in consecutive key frame pairs in the target video content are the same, and are images obtained by capturing different target parts. For example, when a wide landscape, a large object, or a long object is to be photographed, this is a case in which the whole is expressed by reproducing video content by combining key frames in which a part of the object is photographed.

関連性４についての関連性フラグには、あるキーフレーム中の対象領域とその次のキーフレーム中の対象領域とが対象の部分関係にある場合には１を、対象の部分関係でない場合には０を設定する。対象の部分関係は、映像コンテンツ内で連続するキーフレーム中の同一と判定された対象領域に共通する部分領域（共通領域）をもとに判定できる。例えば、対象領域の一方をテンプレートとし、もう一方の対象領域を走査して差分の少ない位置を検出し重なった領域を共通領域とする。各対象領域の共通領域以外の領域がどちらも規定面積以上の場合に、対象の部分関係と判定する。あるいは、映像コンテンツに含まれる全キーフレームから同一と判定された対象領域の相対的な位置をもとに判定してもよい。 The relevance flag for relevance 4 is 1 if the target area in a key frame and the target area in the next key frame are in a target partial relationship, and 1 if not in the target partial relationship. Set to 0. The target partial relationship can be determined based on a partial area (common area) common to the target areas determined to be the same in consecutive key frames in the video content. For example, one of the target areas is used as a template, and the other target area is scanned to detect a position with a small difference, and an overlapping area is set as a common area. When both regions other than the common region of each target region are larger than the specified area, it is determined that the target partial relationship. Alternatively, the determination may be made based on the relative position of the target area determined to be the same from all the key frames included in the video content.

全体から部分への対象の変化が連続した場合には、関連性の変化は無いものとして、同様の提示方法の変更を行なう。 When the change of the object from the whole to the part continues, it is assumed that there is no change in relevance, and the same presentation method is changed.

〔対象の部分関係に応じたルール〕
（４−１）提示時間に関するルール
連続するキーフレーム対に含まれる対象の部分関係をもとに、キーフレーム対の提示時間を決定する。例えば、対象の部分関係にあるキーフレーム群のうち、はじめに提示されるキーフレームの提示時間を初期値Ｔｓとし、Ｔｓを基準として後続のキーフレームの提示時間を決定する。また、対象の部分関係にあるキーフレーム群のうち、視認性の高いキーフレームの提示時間をＴｐとし、Ｔｐを基準に後続のキーフレームの提示時間を決定してもよい。また、部分関係にあるキーフレーム群のうち、キーフレームの提示時間がＴｑ以下になった次のキーフレームの提示時間を初期値Ｔｓとし、Ｔｓを基準に後続のキーフレームの提示時間を決定してもよい。また、部分関係にあるキーフレーム群のうち、最後に提示される画像の提示時間をＴｓに設定してもよい。Ｔｓ，Ｔｐの値は、あらかじめ映像コンテンツ全体の提示時間を設定しておき、提示する画像数に応じて算出してもよい。連続するキーフレーム対に含まれる対象に部分関係がない場合には、前のキーフレームの提示時間と独立に後続の提示時間を決定する。例えば、初期値Ｔｓに設定してもよいし、規定範囲内のランダムな値に設定してもよい。 [Rules according to the target partial relationship]
(4-1) Rules for Presentation Time Based on the partial relationship of objects included in consecutive key frame pairs, the presentation time of key frame pairs is determined. For example, in the group of key frames in the target partial relationship, the presentation time of the first key frame presented is set as the initial value Ts, and the presentation time of the subsequent key frame is determined based on Ts. Moreover, the presentation time of the key frame with high visibility among the key frame groups in the target partial relationship may be Tp, and the presentation time of the subsequent key frame may be determined based on Tp. In addition, the presentation time of the next key frame in which the presentation time of the key frame is equal to or less than Tq is set as the initial value Ts, and the presentation time of the subsequent key frame is determined based on Ts. May be. In addition, the presentation time of the image presented last among the key frame groups having a partial relationship may be set to Ts. The values of Ts and Tp may be calculated according to the number of images to be presented by setting the presentation time of the entire video content in advance. When there is no partial relationship among the objects included in the consecutive key frame pairs, the subsequent presentation time is determined independently of the presentation time of the previous key frame. For example, the initial value Ts may be set, or a random value within a specified range may be set.

図８を用いて、風景を撮影したキーフレームを再生したキーフレームを再生させる場合について説明する。連続するキーフレーム間の部分関係は、映像コンテンツに含まれる全キーフレームから検出された対象領域のうち同一と判定された対象領域の間で共通する部分領域と、対象領域との位置関係をもとに判定したとする。また、あるキーフレームの提示時間に対して規定パラメータを乗算させることで、次のキーフレームの提示時間を算出するとする。 A case of reproducing a key frame obtained by reproducing a key frame obtained by photographing a landscape will be described with reference to FIG. The partial relationship between consecutive key frames has a positional relationship between a target region and a partial region that is common among target regions determined to be the same among target regions detected from all key frames included in the video content. It is assumed that Further, it is assumed that the presentation time of the next key frame is calculated by multiplying the presentation time of a certain key frame by a specified parameter.

始めのキーフレーム８０１の提示時間を初期値Ｔｓとする。キーフレーム８０１と８０２、８０２と８０３は部分関係があり、キーフレーム８０３と８０４とは部分関係がない。このとき、始めのキーフレーム８０１の提示時間を初期値Ｔｓとすると、キーフレーム８０１、８０２の関連性フラグが１のため、キーフレーム８０２の提示時間はａ×Ｔｓとなる。さらにキーフレーム８０２、８０３の関連性フラグがまたも１のため、キーフレーム８０３の提示時間はａ²Ｔｓとなる。キーフレーム８０３と８０４の関連性フラグは０のため、キーフレーム８０４の提示時間を初期値にもどしてＴｓとする。 The presentation time of the first key frame 801 is set as an initial value Ts. Key frames 801 and 802 and 802 and 803 have a partial relationship, and key frames 803 and 804 have no partial relationship. At this time, if the presentation time of the first key frame 801 is the initial value Ts, the relevance flag of the key frames 801 and 802 is 1, so the presentation time of the key frame 802 is a × Ts. Further, since the relevance flag of the key frames 802 and 803 is again 1, the presentation time of the key frame 803 is a ² Ts. Since the relevance flag of the key frames 803 and 804 is 0, the presentation time of the key frame 804 is returned to the initial value to be Ts.

パラメータａは、０から１の間で、かつ、キーフレーム間で一致する部分領域の面積が大であるほど小さい値を設定する。そうすると、風景について初めて提示されるキーフレーム８０１が長く提示され、その他の部分は前に提示された画像との重複する情報量に応じた提示時間で提示される。これにより、利用者は、風景について始めに提示されたキーフレームについて内容を理解し、以降の内容が始めのキーフレームとほぼ同等の内容であることを理解することができる。また、同じ対象を含む画像であっても、連続する画像の提示時間が変化する映像を生成することができるため、視聴者を飽きさせないテンポ感ある映像コンテンツが生成できるという効果がある。 The parameter a is set to a smaller value between 0 and 1 and as the area of the partial region that matches between the key frames is larger. Then, the key frame 801 presented for the first time with respect to the landscape is presented for a long time, and the other part is presented with a presentation time corresponding to the amount of information overlapping with the previously presented image. As a result, the user can understand the contents of the key frame presented first for the landscape, and can understand that the subsequent contents are substantially equivalent to the first key frame. In addition, even if the images include the same target, it is possible to generate a video in which the presentation time of successive images changes, so that it is possible to generate video content with a sense of tempo that does not bore the viewer.

（４−２）エフェクト・ＢＧＭ・ジングルに関するルール
連続するキーフレーム対に含まれる対象の部分関係をもとに、キーフレーム対の間に挿入するエフェクト・ＢＧＭ・ジングルを決定する。例えば、連続するキーフレーム対に含まれる対象が部分関係にある場合には、キーフレームの切り替え時に視覚的な変化の少ないエフェクトとしてあらかじめ登録された特殊効果（ディゾルブやフェード等）を挿入する。部分関係にない場合には、キーフレームの切り替え時に視覚的な変化の大きいエフェクトとしてあらかじめ登録された特殊効果（ページめくり、ワイプ等のＤＶＥ）を挿入する。また例えば、連続するキーフレーム対が対象の部分関係にある場合には、キーフレーム対の提示中同じＢＧＭを流し、同一でない場合には、キーフレームの切り替え時にＢＧＭを止めるもしくは異なるＢＧＭに切り替える。また、大小関係が存在しない画像間に、ジングルを挿入してもよい。これにより、連続するキーフレーム対が対象の部分関係にある場合、画像や音響的な変化がなく滑らかに接続される。そのため、視聴者は、キーフレームが互いに変化がなくほぼ同じ内容であることを容易に理解できる。また、部分関係にない場合には、画像や音響的な変化が大きく変化するために、視聴者は内容に変化があったことに気づき映像コンテンツの内容理解に集中することができる。 (4-2) Rules for Effect, BGM, and Jingle The effect, BGM, and jingle to be inserted between the key frame pairs are determined based on the target partial relationship included in the continuous key frame pairs. For example, if the objects included in successive key frame pairs are in a partial relationship, a special effect (such as dissolve or fade) registered in advance as an effect with little visual change when key frames are switched is inserted. When there is no partial relationship, a special effect (DVE such as page turning or wipe) registered in advance as an effect having a large visual change at the time of switching key frames is inserted. Also, for example, when consecutive key frame pairs are in the target partial relationship, the same BGM is played during presentation of the key frame pairs, and when they are not identical, the BGM is stopped or switched to a different BGM when the key frames are switched. Also, jingles may be inserted between images that do not have a magnitude relationship. Thereby, when a continuous key frame pair has a target partial relationship, there is no image or acoustic change, and the connection is smooth. Therefore, the viewer can easily understand that the key frames have almost the same contents without any change. In addition, when there is no partial relationship, image and acoustic changes greatly change, so that the viewer can notice that there is a change in the content and can concentrate on understanding the content of the video content.

（関連性５．対象の同種性）
関連性判定部２２１は、関連性を、２つのキーフレームに表わされた対象が同種か否かによって決定してもよい。このように決定された関連性を関連性５と称する。 (Relevance 5. Target homogeneity)
The relevance determination unit 221 may determine relevance depending on whether or not the objects represented in the two key frames are of the same type. The relationship thus determined is referred to as relationship 5.

「対象が同種である」とは、映像コンテンツ内で連続するキーフレーム対に映っている主要な対象が、互いに同じ種別の対象であることとする。関連性５についての関連性フラグには、あるキーフレーム中の対象領域とその次のキーフレーム中の対象領域とが同種の関係にある場合には１を、異種の場合には０を設定する。対象の同種性は、同種性を判別したい各種別に属する対象の画像データ（登録データ）をもとに、機械学習に基づく方法で実現できる。まず登録データから各種別に属する対象の画像特徴量を抽出する。画像特徴量として、色ヒストグラムやエッジヒストグラム等の大域特徴を用いてもよいし、ＨｏＧやＳＩＦＴ等の局所特徴量を用いてもよい。大域特徴を用いてＳＶＭやニューラルネットワークやＧＭＭ等で学習を行う、あるいは、局所特徴量からＢｏＷ（ＢａｇｏｆＷｏｒｄｓ）のように特徴量空間の変換を行った上で学習を行ってもよい。映像コンテンツに含まれる各キーフレーム中の対象領域について同種性を判別する際は、各対象領域の画像特徴量と、学習の結果得られた各種別のモデルとの間でそれぞれ類似性を求め、対象領域を規定値以上の類似度を得た最も近いモデルの種別と判定する。同じ種別と判定された対象領域を同種と判定する。同種性は、上記以外の方法で判定しても構わない。 “The objects are of the same type” means that main objects appearing in consecutive key frame pairs in the video content are objects of the same type. The relevance flag for relevance 5 is set to 1 when the target area in a key frame and the target area in the next key frame have the same kind of relationship, and set to 0 in the case of different kinds. . The homogeneity of an object can be realized by a method based on machine learning based on image data (registered data) of an object belonging to various types for which homogeneity is to be determined. First, target image feature quantities belonging to various types are extracted from the registered data. As the image feature amount, a global feature such as a color histogram or an edge histogram may be used, or a local feature amount such as HoG or SIFT may be used. Learning may be performed using SVM, neural network, GMM, or the like using global features, or learning may be performed after converting a feature amount space such as BoW (Bag of Words) from a local feature amount. When determining the homogeneity for the target area in each key frame included in the video content, obtain the similarity between the image feature quantity of each target area and various models obtained as a result of learning, The target area is determined to be the closest model type that has obtained a similarity equal to or greater than a specified value. Target areas determined to be the same type are determined to be the same type. The homogeneity may be determined by a method other than the above.

同種の対象を含む画像が３つ連続した場合には、関連性の変化は無いものとして、同様の提示方法の変更を行なう。 When three images including the same type of target are consecutive, it is assumed that there is no change in relevance, and the same presentation method is changed.

〔対象の同種性に応じたルール〕
（５−１）提示時間に関するルール
連続するキーフレーム対に含まれる対象の同種性をもとに、キーフレーム対の提示時間を決定する。例えば、同種の対象を撮影したキーフレームが連続した場合、徐々に提示時間を短くしていく。すなわち、同種の対象を含むキーフレーム群のうち、はじめに提示されるキーフレームの提示時間を初期値Ｔｓとし、Ｔｓを基準として後続のキーフレームの提示時間を決定する。また、同種の対象を含むキーフレーム群のうち、視認性の高いキーフレームの提示時間をＴｐとし、Ｔｐを基準に後続のキーフレームの提示時間を決定してもよい。また、同種の対象を含むキーフレーム群のうち、キーフレームの提示時間がＴｑ以下になった次のキーフレームの提示時間を初期値Ｔｓとし、Ｔｓを基準に後続のキーフレームの提示時間を決定してもよい。また、同種の対象を含むキーフレーム群のうち、最後に提示される画像の提示時間をＴｓに設定してもよい。Ｔｓ，Ｔｐの値は、あらかじめ映像コンテンツ全体の提示時間を設定しておき、提示する画像数に応じて算出してもよい。連続するキーフレーム対に含まれる対象が同種でない場合には、前のキーフレームの提示時間と独立に後続の提示時間を決定する。例えば、初期値Ｔｓに設定してもよいし、規定範囲内のランダムな値に設定してもよい。 [Rules according to target homogeneity]
(5-1) Rules for Presentation Time The presentation time of a key frame pair is determined based on the homogeneity of objects included in consecutive key frame pairs. For example, when key frames that photograph the same type of object are consecutive, the presentation time is gradually shortened. That is, among the key frame groups including the same type of target, the presentation time of the key frame presented first is set as the initial value Ts, and the presentation time of the subsequent key frame is determined based on Ts. In addition, among key frame groups including the same type of target, the presentation time of a key frame with high visibility may be Tp, and the presentation time of subsequent key frames may be determined based on Tp. Also, among key frame groups including the same type of target, the presentation time of the next key frame whose key frame presentation time is Tq or less is set as the initial value Ts, and the presentation time of the subsequent key frame is determined based on Ts. May be. Moreover, you may set the presentation time of the image shown last among the key frame groups containing the same kind of object to Ts. The values of Ts and Tp may be calculated according to the number of images to be presented by setting the presentation time of the entire video content in advance. If the objects included in successive key frame pairs are not the same type, the subsequent presentation time is determined independently of the presentation time of the previous key frame. For example, the initial value Ts may be set, or a random value within a specified range may be set.

図９を用いて、花を撮影したキーフレームを再生したキーフレームを再生させる場合について説明する。連続するキーフレーム間の同種性は、機械学習に基づく方法で判定したとする。また、あるキーフレームの提示時間に対して、関連性フラグ分のパラメータを乗算させることで、次のキーフレームの提示時間を算出する。始めのキーフレーム９０１の提示時間を初期値Ｔｓとする。キーフレーム９０１、９０２、キーフレーム９０２、９０３は同種、キーフレーム９０３、９０４は異種の関係である。このとき、始めのキーフレーム９０１、９０２の関連性フラグが１のため、キーフレーム９０２の提示時間はａ×Ｔｓとなる。さらにキーフレーム９０２、９０３の関連性フラグが１のため、キーフレーム９０３の提示時間はａ²Ｔｓとなる。キーフレーム９０３、９０４の関連性フラグは０のため、キーフレーム９０４の提示時間を初期値にもどしてＴｓとする。パラメータａを０から１の間に設定すると、植物を含むキーフレームのうち初めてに提示されたキーフレーム９０１が長く提示され、後続のキーフレームは９０１から離れるほど短い提示時間で提示される。これにより、利用者は、始めに提示されたキーフレームから画像内容が植物であることを理解し、以降のキーフレームの内容がほぼ同等であることを理解することができる。 The case where the key frame which reproduced the key frame which image | photographed the flower is reproduced is demonstrated using FIG. Assume that the homogeneity between consecutive key frames is determined by a method based on machine learning. Also, the presentation time of the next key frame is calculated by multiplying the presentation time of a certain key frame by the parameter for the relevance flag. The presentation time of the first key frame 901 is an initial value Ts. The key frames 901 and 902 and the key frames 902 and 903 have the same type, and the key frames 903 and 904 have a different type. At this time, since the relevance flag of the first key frames 901 and 902 is 1, the presentation time of the key frame 902 is a × Ts. Further, since the relevance flag of the key frames 902 and 903 is 1, the presentation time of the key frame 903 is a ² Ts. Since the relevance flag of the key frames 903 and 904 is 0, the presentation time of the key frame 904 is returned to the initial value to be Ts. When the parameter a is set between 0 and 1, the key frame 901 presented for the first time among the key frames including the plant is presented for a long time, and the subsequent key frames are presented with a short presentation time as the distance from the 901 is increased. Thereby, the user can understand that the image content is a plant from the key frame presented first, and can understand that the content of the subsequent key frame is almost the same.

また、同じ対象を含む画像であっても、連続する画像の提示時間が変化する映像を生成することができるため、視聴者を飽きさせないテンポ感ある映像コンテンツが生成できるという効果がある。花畑で撮影した複数の花の画像を、同種の被写体を順に再生することでこの種の被写体が沢山存在したことを表現できる。 In addition, even if the images include the same target, it is possible to generate a video in which the presentation time of successive images changes, so that it is possible to generate video content with a sense of tempo that does not bore the viewer. By reproducing images of multiple flowers taken in the flower field in order of the same type of subject, it can be expressed that there are many such types of subjects.

（５−２）エフェクト・ＢＧＭ・ジングルに関するルール
連続するキーフレーム対に含まれる対象の同種性をもとに、キーフレーム対の間に挿入するエフェクト・ＢＧＭ・ジングルを決定する。例えば、連続するキーフレーム対に含まれる対象が同種の場合には、キーフレームの切り替え時に視覚的な変化の少ないエフェクトとしてあらかじめ登録された特殊効果（ディゾルブやフェード等）を挿入する。異種の場合には、キーフレームの切り替え時に視覚的な変化の大きいエフェクトとしてあらかじめ登録された特殊効果（ページめくり、ワイプ等のＤＶＥ）を挿入する。また例えば、連続するキーフレーム対が同種の場合には、キーフレーム対の提示中同じＢＧＭを流し、異種の場合には、キーフレームの切り替え時にＢＧＭを止めるもしくは異なるＢＧＭに切り替える。また、異種のキーフレーム間に、ジングルを挿入してもよい。これにより、連続するキーフレーム対に含まれる対象が同種の場合、画像や音響的な変化がなく滑らかに接続される。そのため、視聴者は、キーフレームが互いに変化がなくほぼ同じ内容であることを容易に理解できる。また、異種の場合には、画像や音響的な変化が大きく変化するために、視聴者は内容に変化があったことに気づき映像コンテンツの内容理解に集中することができる。 (5-2) Effect / BGM / Jingle Rules The effect / BGM / jingle to be inserted between key frame pairs is determined based on the homogeneity of objects included in consecutive key frame pairs. For example, when the targets included in successive key frame pairs are of the same type, a special effect (such as dissolve or fade) registered in advance as an effect with little visual change when key frames are switched is inserted. In the case of different types, a special effect (DVE such as page turning or wipe) registered in advance as an effect having a large visual change at the time of switching key frames is inserted. Further, for example, when consecutive key frame pairs are of the same type, the same BGM is played during presentation of the key frame pair, and when different, the BGM is stopped or switched to a different BGM when the key frame is switched. A jingle may be inserted between different types of key frames. Thereby, when the object contained in the continuous key frame pair is of the same type, there is no image or acoustic change, and the connection is smooth. Therefore, the viewer can easily understand that the key frames have almost the same contents without any change. In the case of different types, since the image and acoustic change greatly change, the viewer can notice that the content has changed and can concentrate on understanding the content of the video content.

（関連性６．撮影場所の同一性）
関連性判定部２２１は、関連性を、２つのキーフレームの撮影場所の共通性によって決定してもよい。このように決定された関連性を関連性６と称する。 (Relevance 6. Identity of shooting location)
The relevance determination unit 221 may determine the relevance based on the commonality of the shooting locations of the two key frames. The relationship thus determined is referred to as relationship 6.

「撮影場所が同一である」とは、映像コンテンツ内で連続するキーフレーム対を撮影した場所が同一であることとする。関連性６についての関連性フラグには、あるキーフレームとその次のキーフレームとが同じ撮影場所である場合には１を、異なる撮影場所である場合には０を設定する。撮影場所の同一性は、キーフレーム中の対象領域以外の領域（背景領域）の類似度をもとに判定できる。例えば、キーフレームから対象領域と背景領域を分離し、背景領域から抽出した画像特徴量が類似する場合に、同一の撮影場所と判定してもよい。撮影場所の同一性は、上記以外の方法で判定しても構わない。撮影場所の同一性は、映像コンテンツ内で連続するキーフレームの間で背景の類似性を判定してもよい。あるいは、映像コンテンツに含まれる全キーフレーム中の背景領域の同一性をもとに判定してもよい。画像情報に加えて、メタ情報である撮影場所やセンサ情報であるＧＰＳを組み合わせて判定してもよい。 “The shooting location is the same” means that the location where the consecutive key frame pairs are shot in the video content is the same. The relevance flag for relevance 6 is set to 1 when a key frame and the next key frame are at the same shooting location, and set to 0 when they are at different shooting locations. The identity of the shooting location can be determined based on the similarity of an area (background area) other than the target area in the key frame. For example, when the target region and the background region are separated from the key frame and the image feature values extracted from the background region are similar, it may be determined that they are the same shooting location. The identity of the shooting location may be determined by a method other than the above. The identity of the shooting location may determine the similarity of the background between consecutive key frames in the video content. Alternatively, the determination may be made based on the identity of the background area in all key frames included in the video content. In addition to the image information, determination may be made by combining shooting location as meta information and GPS as sensor information.

同じ撮影場所で撮影された画像が３つ連続した場合には、関連性の変化は無いものとして、同様の提示方法の変更を行なう。 When three images taken at the same shooting location are consecutive, it is assumed that there is no change in relevance, and the same presentation method is changed.

〔撮影場所の同一性に応じたルール〕
（６−１）提示時間に関するルール
連続するキーフレーム対の撮影場所の同一性をもとに、キーフレーム対の提示時間を決定する。例えば、同一の撮影場所で撮影されたキーフレームが連続した場合、徐々に提示時間を短くしていく。例えば、同一の場所で撮影されたキーフレーム群のうち、はじめに提示されるキーフレームの提示時間を初期値Ｔｓとし、Ｔｓを基準として後続のキーフレームの提示時間を決定する。また、同一の場所で撮影されたキーフレーム群のうち、視認性の高いキーフレームの提示時間をＴｐとし、Ｔｐを基準に後続のキーフレームの提示時間を決定してもよい。また、同一の場所で撮影されたキーフレーム群のうち、キーフレームの提示時間がＴｑ以下になった次のキーフレームの提示時間を初期値Ｔｓとし、Ｔｓを基準に後続のキーフレームの提示時間を決定してもよい。また、同一の場所で撮影されたキーフレーム群のうち、最後に提示される画像の提示時間をＴｓに設定してもよい。Ｔｓ，Ｔｐの値は、あらかじめ映像コンテンツ全体の提示時間を設定しておき、提示する画像数に応じて算出してもよい。連続するキーフレーム対が異なる場所で撮影された場合には、前のキーフレームの提示時間と独立に後続の提示時間を決定する。例えば、初期値Ｔｓに設定してもよいし、規定範囲内のランダムな値に設定してもよい。 [Rules according to the identity of the shooting location]
(6-1) Rules for Presentation Time Based on the identity of the shooting locations of consecutive key frame pairs, the presentation time of key frame pairs is determined. For example, when key frames shot at the same shooting location are consecutive, the presentation time is gradually shortened. For example, in the group of key frames shot at the same place, the presentation time of the first key frame presented is set as the initial value Ts, and the presentation time of the subsequent key frame is determined based on Ts. In addition, among the key frame groups photographed at the same place, the presentation time of a highly visible key frame may be Tp, and the presentation time of the subsequent key frame may be determined based on Tp. In addition, among the key frame groups photographed at the same place, the presentation time of the next key frame whose key frame presentation time is equal to or less than Tq is set as the initial value Ts, and the presentation time of subsequent key frames based on Ts. May be determined. In addition, the presentation time of the image presented last among the key frame groups photographed at the same place may be set to Ts. The values of Ts and Tp may be calculated according to the number of images to be presented by setting the presentation time of the entire video content in advance. When successive key frame pairs are photographed at different locations, the subsequent presentation time is determined independently of the previous key frame presentation time. For example, the initial value Ts may be set, or a random value within a specified range may be set.

（６−２）エフェクト・ＢＧＭ・ジングルに関するルール
連続するキーフレーム対の撮影場所の同一性をもとに、キーフレーム対の間に挿入するエフェクト・ＢＧＭ・ジングルを決定する。例えば、連続するキーフレーム対が同じ場所で撮影された場合には、キーフレームの切り替え時に視覚的な変化の少ないエフェクトとしてあらかじめ登録された特殊効果（ディゾルブやフェード等）を挿入する。異なる場所で撮影された場合には、キーフレームの切り替え時に視覚的な変化の大きいエフェクトとしてあらかじめ登録された特殊効果（ページめくり、ワイプ等のＤＶＥ）を挿入する。また例えば、連続するキーフレーム対が同じ場所で撮影された場合には、キーフレーム対の提示中同じＢＧＭを流し、異なる場所で撮影された場合には、キーフレームの切り替え時にＢＧＭを止めるもしくは異なるＢＧＭに切り替える。また、異なる場所で撮影されたキーフレーム間に、ジングルを挿入してもよい。これにより、連続するキーフレーム対が同じ場所で撮影された場合、画像や音響的な変化がなく滑らかに接続される。そのため、視聴者は、キーフレームが互いに変化がなくほぼ同じ内容であることを容易に理解できる。また、異なる場所で撮影された場合には、画像や音響的な変化が大きく変化するため、視聴者は内容に変化があることに気づき映像コンテンツの内容理解に集中することができる。 (6-2) Effects, BGM, and Jingle Rules Based on the identity of the shooting locations of consecutive key frame pairs, the effect, BGM, and jingle inserted between the key frame pairs are determined. For example, when consecutive key frame pairs are photographed at the same place, special effects (such as dissolves and fades) registered in advance are inserted as effects with little visual change when switching key frames. When the image is taken at a different place, a special effect (DVE such as page turning or wipe) registered in advance is inserted as an effect having a large visual change when the key frame is switched. Also, for example, when consecutive key frame pairs are photographed at the same location, the same BGM is played during presentation of the key frame pairs, and when photographed at different locations, the BGM is stopped or different when the key frames are switched. Switch to BGM. A jingle may be inserted between key frames taken at different locations. As a result, when consecutive key frame pairs are photographed at the same place, they are smoothly connected without any image or acoustic change. Therefore, the viewer can easily understand that the key frames have almost the same contents without any change. In addition, when images are taken at different places, image and acoustic changes greatly change, so that the viewer can notice that there is a change in the content and can concentrate on understanding the content of the video content.

（関連性７．撮影時間帯の同一性）
関連性判定部２２１は、関連性を、キーフレーム対に含まれる２つのキーフレームの撮影時間帯の共通性によって決定してもよい。このように決定された関連性を関連性７と称する。 (Relevance 7. Same time zone)
The relevancy determination unit 221 may determine the relevance based on the commonality of the shooting time zones of two key frames included in the key frame pair. The relationship determined in this way is referred to as relationship 7.

「撮影時間帯が同一である」とは、映像コンテンツ内で連続するキーフレーム対を撮影した時間帯が同一であることとする。関連性７についての関連性フラグには、あるキーフレームとその次のキーフレーム中とが同じ撮影時間帯である場合には１を、異なる撮影場所である場合には０を設定する。撮影時間帯の同一性は、キーフレーム中の背景領域の色情報をもとに判定できる。例えば、１日を複数の時間帯に分割し、各時間帯における太陽光の色ヒストグラムの統計量を保持し、キーフレームの背景領域中にいずれかの時間帯の統計量と近い部分領域が含まれるときに、その時間帯に撮影されたキーフレームと判定する。各キーフレームの撮影時間帯を推定し、推定時間が同じ場合に撮影時間帯が同一と判定する。撮影時間帯の同一性は、上記以外の方法で判定しても構わない。撮影時間帯の同一性は、映像コンテンツ内で連続するキーフレームの間で撮影時間帯の類似性を判定してもよい。あるいは、映像コンテンツに含まれる全キーフレーム中の撮影時間帯の同一性をもとに判定してもよい。画像情報に加えて、メタ情報である撮影時刻と組み合わせて判定してもよい。 “The shooting time zones are the same” means that the time zones in which consecutive key frame pairs are shot in the video content are the same. The relevance flag for relevance 7 is set to 1 when a certain key frame and the next key frame are in the same shooting time zone, and set to 0 when they are in different shooting locations. The identity of the shooting time zone can be determined based on the color information of the background area in the key frame. For example, a day is divided into a plurality of time zones, the statistics of the color histogram of sunlight in each time zone are held, and the partial area close to the statistics of any time zone is included in the background area of the key frame The key frame is taken during that time period. The shooting time zone of each key frame is estimated, and when the estimated time is the same, it is determined that the shooting time zones are the same. The identity of the shooting time period may be determined by a method other than the above. The identity of the shooting time zone may determine the similarity of the shooting time zone between consecutive key frames in the video content. Alternatively, the determination may be made based on the identity of the shooting time zones in all the key frames included in the video content. In addition to the image information, the determination may be made in combination with the shooting time that is meta information.

同じ撮影時間帯に撮影された画像が３つ連続した場合には、関連性の変化は無いものとして、同様の提示方法の変更を行なう。例えば、同じ時間間隔で徐々に提示時間を短くしていく。 When three images taken in the same shooting time period are consecutive, it is assumed that there is no change in relevance, and the same presentation method is changed. For example, the presentation time is gradually shortened at the same time interval.

〔撮影時間帯の同一性に応じたルール〕
（７−１）提示時間に関するルール
連続するキーフレーム対の撮影時間帯の同一性をもとに、キーフレーム対の提示時間を決定する。例えば、一定範囲の撮影時間に撮影されたキーフレームが連続した場合、徐々に提示時間を短くしていく。すなわち、同一の時間帯に撮影されたキーフレーム群のうち、はじめに提示されるキーフレームの提示時間を初期値Ｔｓとし、Ｔｓを基準として後続のキーフレームの提示時間を決定する。また、同一の時間帯に撮影されたキーフレーム群のうち、視認性の高いキーフレームの提示時間をＴｐとし、Ｔｐを基準に後続のキーフレームの提示時間を決定してもよい。また、同一の時間帯に撮影されたキーフレーム群のうち、キーフレームの提示時間がＴｑ以下になった次のキーフレームの提示時間を初期値Ｔｓとし、Ｔｓを基準に後続のキーフレームの提示時間を決定してもよい。また、同一の時間帯に撮影されたキーフレーム群のうち、最後に提示される画像の提示時間をＴｓに設定してもよい。Ｔｓ，Ｔｐの値は、あらかじめ映像コンテンツ全体の提示時間を設定しておき、提示する画像数に応じて算出してもよい。連続するキーフレーム対が異なる時間帯に撮影された場合には、前のキーフレームの提示時間と独立に後続の提示時間を決定する。例えば、初期値Ｔｓに設定してもよいし、規定範囲内のランダムな値に設定してもよい。なお、ソース動画が長時間にわたるノーカット動画である場合、キーフレームを抽出した箇所の動画の撮影時間帯が大きく違っていても関連性のあるキーフレームとして判定することができる。 [Rules according to the identity of the shooting period]
(7-1) Rules Regarding Presentation Time Based on the identity of the shooting time zones of consecutive key frame pairs, the presentation time of key frame pairs is determined. For example, in the case where key frames photographed within a certain range of photographing time are consecutive, the presentation time is gradually shortened. That is, among the key frame groups photographed in the same time zone, the presentation time of the key frame presented first is set as the initial value Ts, and the presentation time of the subsequent key frame is determined based on Ts. Moreover, the presentation time of the key frame with high visibility among the key frame groups photographed in the same time zone may be Tp, and the presentation time of the subsequent key frame may be determined based on Tp. In addition, among key frame groups photographed in the same time zone, the presentation time of the next key frame whose presentation time of the key frame has become equal to or less than Tq is set as the initial value Ts, and the presentation of subsequent key frames is performed based on Ts. Time may be determined. Moreover, the presentation time of the image presented last among the key frame groups photographed in the same time zone may be set to Ts. The values of Ts and Tp may be calculated according to the number of images to be presented by setting the presentation time of the entire video content in advance. When successive key frame pairs are photographed in different time zones, the subsequent presentation time is determined independently of the presentation time of the previous key frame. For example, the initial value Ts may be set, or a random value within a specified range may be set. When the source moving image is an uncut moving image over a long time, it can be determined as a relevant key frame even if the shooting time zone of the moving image where the key frame is extracted is greatly different.

（７−２）エフェクト・ＢＧＭ・ジングルに関するルール
連続するキーフレーム対の撮影時間帯の同一性をもとに、キーフレーム対の間に挿入するエフェクト・ＢＧＭ・ジングルを決定する。例えば、連続するキーフレーム対が同一の時間帯に撮影された場合には、キーフレームの切り替え時に視覚的な変化の少ないエフェクトとしてあらかじめ登録された特殊効果（ディゾルブやフェード等）を挿入する。異なる時間帯に撮影された場合には、キーフレームの切り替え時に視覚的な変化の大きいエフェクトとしてあらかじめ登録された特殊効果（ページめくり、ワイプ等のＤＶＥ）を挿入する。また例えば、連続するキーフレーム対が同一の時間帯に撮影された場合には、キーフレーム対の提示中同じＢＧＭを流し、異なる時間帯に撮影された場合には、キーフレームの切り替え時にＢＧＭを止めるもしくは異なるＢＧＭに切り替える。また、異なる時間帯のキーフレーム間に、ジングルを挿入してもよい。これにより、連続するキーフレーム対が同一の時間帯に撮影された場合、画像や音響的な変化がなく滑らかに接続される。そのため、視聴者は、キーフレームが互いに変化がなくほぼ同じ内容であることを容易に理解できる。また、異なる時間帯に撮影された場合には、画像や音響的な変化が大きく変化するために、視聴者は内容に変化があったことに気づき映像コンテンツの内容理解に集中することができる。提示ルールは、上記のいずれか１つのルールを適用してもよいし、複数のルールを組み合わせて用いてもよい。映像コンテンツ生成部２０３は、提示方法決定部２０２から入力された提示方法情報と、映像入力部２０４から入力された画像情報をもとに、映像コンテンツを生成する。 (7-2) Rules Regarding Effects, BGM, and Jingles Based on the identity of the shooting time zones of consecutive key frame pairs, the effects, BGM, and jingles to be inserted between the key frame pairs are determined. For example, when consecutive key frame pairs are photographed in the same time zone, special effects (such as dissolves and fades) registered in advance as effects with little visual change are inserted when switching key frames. When the image is taken at a different time, a special effect (DVE such as page turning or wipe) registered in advance is inserted as an effect having a large visual change when the key frame is switched. Also, for example, when consecutive key frame pairs are shot at the same time zone, the same BGM is played during presentation of the key frame pairs, and when shot at different time zones, the BGM is switched at the time of key frame switching. Stop or switch to a different BGM. A jingle may be inserted between key frames in different time zones. As a result, when consecutive key frame pairs are photographed in the same time zone, they are smoothly connected without any image or acoustic change. Therefore, the viewer can easily understand that the key frames have almost the same contents without any change. In addition, when the images are taken at different time periods, image and acoustic changes greatly change, so that the viewer can notice that the contents have changed and can concentrate on understanding the contents of the video content. As the presentation rule, any one of the above rules may be applied, or a plurality of rules may be used in combination. The video content generation unit 203 generates video content based on the presentation method information input from the presentation method determination unit 202 and the image information input from the video input unit 204.

［他の実施形態］
以上、本発明の実施形態について詳述したが、それぞれの実施形態に含まれる別々の特徴を如何様に組み合わせたシステムまたは装置も、本発明の範疇に含まれる。 [Other Embodiments]
As mentioned above, although embodiment of this invention was explained in full detail, the system or apparatus which combined the separate characteristic contained in each embodiment how was included in the category of this invention.

また、本発明は、複数の機器から構成されるシステムに適用されてもよいし、単体の装置に適用されてもよい。さらに、本発明は、実施形態の機能を実現する情報処理プログラムが、システムあるいは装置に直接あるいは遠隔から供給される場合にも適用可能である。したがって、本発明の機能をコンピュータで実現するために、コンピュータにインストールされるプログラム、あるいはそのプログラムを格納した媒体、そのプログラムをダウンロードさせるＷＷＷ(World Wide Web)サーバも、本発明の範疇に含まれる。 In addition, the present invention may be applied to a system composed of a plurality of devices, or may be applied to a single device. Furthermore, the present invention can also be applied to a case where an information processing program that implements the functions of the embodiments is supplied directly or remotely to a system or apparatus. Therefore, in order to realize the functions of the present invention on a computer, a program installed in the computer, a medium storing the program, and a WWW (World Wide Web) server that downloads the program are also included in the scope of the present invention. .

Claims

Extraction means for extracting at least two partial videos or still images from the source video;
Determining means for determining a method of presenting the at least two partial moving images or still images extracted by the extracting means based on characteristics of the source moving images;
Generating means for generating video content including the at least two partial moving images or still images based on the presentation method determined by the determining means;
An information processing apparatus comprising:

The information processing apparatus according to claim 1, wherein the generation unit generates the video content that continuously presents the at least two still images.

The information processing apparatus according to claim 1, wherein the determination unit determines a presentation time of the at least two partial moving images or still images in the video content.

The said determination means determines the other presentation time based on one presentation time, when the said at least 2 partial moving image or a still image has a relation with each other. Information processing device.

When the at least two partial moving images or the objects included in the still image are the same, the determining means includes the partial moving image or the still image inserted after the presentation time of the previously inserted partial moving image or the still image. The information processing apparatus according to claim 4, wherein an image presentation time is shortened.

The said determination means determines presentation time independently, when the said at least 2 partial moving image or a still image does not have a mutual relationship, The presentation time of Claim 3 thru | or 5 characterized by the above-mentioned. Information processing device.

The information processing apparatus according to claim 1, wherein the determination unit determines an effect or a jingle when switching the at least two partial moving images or still images in the video content.

The said determination means determines the effect or jingle different from the case where it does not have a relationship, when the said at least 2 partial moving image or a still image has a relationship with each other. Information processing device.

The information processing apparatus according to claim 1, wherein the determining unit determines background music of the at least two partial moving images or still images in the video content.

The determining means includes
Determination means for determining whether or not the object included in the at least two partial moving images or still images is related based on the source moving image;
A selection means for selecting a presentation method different from the case where the object is not related when the object is related;
The information processing apparatus according to any one of claims 1 to 9, further comprising:

The determination means includes
The information processing apparatus according to claim 10, wherein whether or not the objects included in the at least two partial moving images or still images are the same is determined based on the source moving image.

The determination means includes
The information processing apparatus according to claim 10 or 11, wherein it is determined based on the source moving image whether or not the shooting method of the object in the at least two partial moving images or still images is the same.

The determination means includes
13. The method according to claim 10, wherein whether or not the objects included in the at least two partial moving images or the still image have commonality is determined based on an acoustic feature of the source moving image. The information processing apparatus according to item 1.

An extraction step of extracting at least two partial videos or still images from the source video;
A determination step of determining a method of presenting the at least two partial moving images or still images extracted in the extracting step based on characteristics of the source moving images;
Generating a video content including the at least two partial moving images or still images based on the presentation method determined in the determining step;
An information processing method comprising:

An extraction step of extracting at least two partial videos or still images from the source video;
A determination step of determining a method of presenting the at least two partial moving images or still images extracted in the extracting step based on characteristics of the source moving images;
Generating a video content including the at least two partial moving images or still images based on the presentation method determined in the determining step;
An information processing program for causing a computer to execute.