JP2018007134A

JP2018007134A - Scene extraction device and its program

Info

Publication number: JP2018007134A
Application number: JP2016134108A
Authority: JP
Inventors: 知美長谷川; Tomomi Hasegawa; 大出　訓史; Norifumi Oide; 訓史大出; 小森　智康; Tomoyasu Komori; 智康小森; 靖茂中山; Yasushige Nakayama; 鈴木　陽一; Yoichi Suzuki; 陽一鈴木; 修一坂本; Shuichi Sakamoto; 賢司小澤; Kenji Ozawa; 木下　雄一朗; Yuichiro Kinoshita; 雄一朗木下
Original assignee: Nippon Hoso Kyokai NHK; Japan Broadcasting Corp
Current assignee: Japan Broadcasting Corp
Priority date: 2016-07-06
Filing date: 2016-07-06
Publication date: 2018-01-11
Anticipated expiration: 2036-07-06
Also published as: JP6688179B2

Abstract

【課題】コンテンツから臨場感の高いシーンを抽出するシーン抽出装置を提供する。【解決手段】シーン抽出装置１は、時間区間ごとに、コンテンツの映像、音響特徴量を分析し、コンテンツを構成する信号の特徴量に対する臨場感の度合いを予め学習した臨場感学習データから臨場感推定値を算出する臨場感推定手段１０と、予め設定された閾値に基づいて、時間区間ごとに臨場感が高いか否かを判定する判定手段５０と、臨場感が高いと判定された時間区間に対応するシーンをコンテンツから抽出する抽出手段６０と、を備える。【選択図】図１Kind Code: A1 A scene extracting device for extracting a highly realistic scene from content is provided. Kind Code: A1 A scene extracting apparatus 1 analyzes video and audio feature amounts of content for each time interval, and extracts a sense of presence from presence learning data in which the degree of presence for the feature amounts of signals constituting the content is learned in advance. Presence estimation means 10 for calculating an estimated value; Determination means 50 for determining whether the sense of presence is high for each time interval based on a preset threshold; and extracting means 60 for extracting from the content a scene corresponding to . [Selection drawing] Fig. 1

Description

本発明は、映像音声コンテンツから、視聴者の主観的印象に基づいてシーンを抽出するシーン抽出装置およびそのプログラムに関する。 The present invention relates to a scene extraction apparatus that extracts a scene from video / audio content based on a subjective impression of a viewer, and a program thereof.

従来、映像音声コンテンツから、要約映像、ダイジェスト映像等、ある被写体を抽出する手法が存在する。
この従来の手法として、映像の動きの特徴量である動きベクトル等を用いて映像音声コンテンツから特定のシーンを抽出する手法、顔認識技術を用いて特定の人物が映っているシーンを抽出する手法等、映像特徴量に基づいてシーンを抽出する手法がある（例えば、特許文献１，２参照）。
また、従来の手法として、音声の音響特徴から発話を検出して、発話シーンを抽出する手法がある（例えば、特許文献３参照）。
このように、映像音声コンテンツから、要約映像、ダイジェスト映像等、部分的にシーンを抽出する手法は、映像あるいは音声の特徴量から、具体的な被写体を客観的に抽出する手法が一般的である。 Conventionally, there is a method for extracting a certain subject such as a summary video, a digest video, and the like from video / audio contents.
As a conventional method, a method of extracting a specific scene from video / audio content using a motion vector that is a feature amount of video motion, a method of extracting a scene showing a specific person using a face recognition technology For example, there is a technique for extracting a scene based on a video feature amount (see, for example, Patent Documents 1 and 2).
Further, as a conventional method, there is a method of detecting an utterance from an acoustic feature of speech and extracting an utterance scene (see, for example, Patent Document 3).
As described above, a method for partially extracting a scene such as a summary video or a digest video from video / audio content is generally a method for objectively extracting a specific subject from a feature amount of video or audio. .

特開２０１２−１０２６５号公報JP 2012-10265 A 特開２００８−２８７５９４号公報JP 2008-287594 A 特開２０１４−３３４１７号公報JP, 2014-33417, A

従来の手法は、映像音声コンテンツから客観的な特徴のみで具体的な被写体を抽出するため、抽象的な印象である臨場感の高いシーンを抽出することができないという問題がある。
ここで臨場感とは、あたかもその場にいるような感覚をいう。例えば、「井ノ上，超臨場感コミュニケーションにおける人の感じる臨場感評価，社団法人電子情報通信学会，信学技報，ＣＱ２００８−４７，ｐｐ．７−１２，２００８」では、臨場感は、質感等の空間要素、同時感等の時間要素、情感等の身体要素で構成され、視覚、聴覚等の外的要因や過去の経験等の内的要因の影響を受けて人が感じるものとされている。
すなわち、臨場感の高いシーンは、映像音声コンテンツの客観的な特徴では抽出することはできず、人の主観的な印象に基づいて抽出する必要がある。 Since the conventional method extracts a specific subject only from objective features from video and audio content, there is a problem that a scene with high realistic sensation that is an abstract impression cannot be extracted.
Here, the sense of presence means a sense of being in the place. For example, in "Inoue, Evaluation of the sense of reality that people feel in ultra-realistic communication, The Institute of Electronics, Information and Communication Engineers, IEICE Technical Report, CQ 2008-47, pp. 7-12, 2008", It is composed of spatial elements, time elements such as simultaneous feelings, and body elements such as emotions, and is felt by humans under the influence of external factors such as vision and hearing, and internal factors such as past experiences.
In other words, a scene with a high sense of realism cannot be extracted with objective features of video and audio content, but needs to be extracted based on a person's subjective impression.

本発明は、このような問題に鑑みてなされたものであり、映像音声コンテンツから、主観的な印象を基準として、臨場感の高いシーンを抽出することが可能なシーン抽出装置およびそのプログラムを提供することを課題とする。 The present invention has been made in view of such a problem, and provides a scene extraction apparatus and a program thereof capable of extracting a scene with high presence on the basis of a subjective impression from video and audio content. The task is to do.

前記課題を解決するため、本発明に係るシーン抽出装置は、コンテンツの所定の特徴量の学習によって定義された臨場感により、コンテンツから臨場感の高いシーンを抽出するシーン抽出装置であって、臨場感学習データ記憶手段と、設定情報記憶手段と、臨場感推定手段と、臨場感推定値記憶手段と、判定手段と、判定結果記憶手段と、抽出手段と、を備える構成とした。 In order to solve the above-described problem, a scene extraction device according to the present invention is a scene extraction device that extracts a highly realistic scene from content based on a sense of presence defined by learning a predetermined feature amount of the content. The sensation learning data storage unit, the setting information storage unit, the presence sensation estimation unit, the presence sensation estimated value storage unit, the determination unit, the determination result storage unit, and the extraction unit are provided.

かかる構成において、シーン抽出装置は、臨場感学習データ記憶手段に、コンテンツを構成する映像信号、音響信号等の信号の特徴量に対する臨場感の度合いを予め学習した臨場感学習データを記憶しておく。この臨場感学習データは、例えば、ニューラルネットワーク、機械学習等により学習したデータである。また、シーン抽出装置は、設定情報記憶手段に、少なくとも臨場感を分析するための前記信号の時間間隔および時間幅と、臨場感が高いか否かの判定基準となる閾値とを設定情報として予め記憶しておく。 In such a configuration, the scene extraction apparatus stores, in the presence learning data storage unit, the presence learning data in which the degree of presence with respect to the feature amount of a signal such as a video signal or an audio signal that constitutes the content is previously learned. . This presence learning data is data learned by, for example, neural network, machine learning, or the like. In addition, the scene extraction device stores in advance in the setting information storage means, as setting information, a time interval and a time width of the signal for analyzing the presence, and a threshold value as a criterion for determining whether or not the presence is high. Remember.

そして、シーン抽出装置は、臨場感推定手段によって、コンテンツの特徴量を分析し、臨場感学習データから、臨場感の度合いを臨場感推定値として算出する。このとき、臨場感推定手段は、重複区間を含んで所定の時間間隔だけずれた所定の時間幅の時間区間ごとに、コンテンツの臨場感推定値を算出し、時間区間に対応付けて臨場感推定値を臨場感推定値記憶手段に記憶する。このように、重複区間を持たせることで、臨場感推定手段は、所定の時間間隔で臨場感を推定する際に、臨場感推定値を算出するために十分な時間を確保することができる。
これによって、シーン抽出装置は、人が主観的な印象として感じるコンテンツの時間の経過に伴って変化する臨場感の状態を、時間区間単位で、時系列の臨場感推定値として求めることができる。 Then, the scene extraction device analyzes the feature amount of the content by the presence estimator, and calculates the degree of presence as a presence estimation value from the presence learning data. At this time, the realistic sensation estimation means calculates the realistic sensation estimated value of the content for each time interval having a predetermined time width that includes the overlapping interval and is shifted by a predetermined time interval, and associates the realistic sensation with the time interval. The value is stored in the realistic sensation estimated value storage means. As described above, by providing the overlapping section, the presence sensation estimation means can secure a sufficient time for calculating the presence sensation estimation value when estimating the presence sensation at a predetermined time interval.
As a result, the scene extraction apparatus can obtain a realistic feeling state that changes with the passage of time of the content that a person feels as a subjective impression as a time-series realistic feeling estimated value for each time interval.

そして、シーン抽出装置は、判定手段によって、予め設定された閾値に基づいて、臨場感推定値が対応付けられた時間区間のシーンの臨場感が高いか否かを判定し、判定結果を、判定結果記憶手段に記憶する。なお、閾値は、臨場感を判定する基準となる臨場感推定値そのもの（絶対閾値）であってもよいし、臨場感推定値の大きいものからどの程度の割合でシーンを抽出するのかを示す相対閾値であっても構わない。
これによって、シーン抽出装置は、時系列の時間区間において、臨場感が高いか否かを判定することができる。 Then, the scene extraction device determines whether or not the presence of the scene in the time interval associated with the estimated presence value is high based on a preset threshold by the determination unit, and determines the determination result. Store in the result storage means. Note that the threshold value may be the realistic sensation estimation value itself (absolute threshold) that serves as a criterion for determining the sensation of realism, or a relative value that indicates how much the scene is extracted from the one with the large realistic sensation estimation value. It may be a threshold value.
Thereby, the scene extraction apparatus can determine whether or not the presence is high in the time series of time sections.

そして、シーン抽出装置は、抽出手段によって、判定結果記憶手段に記憶されている臨場感が高いと判定された時間区間に対応するシーンをコンテンツから抽出する。すなわち、抽出手段は、臨場感が高いと判定された連続した時間区間のうちで、最初の時間区間の先頭時間から、最後の時間区間の終了時間までの時間に相当するシーンをコンテンツから抽出する。
これによって、シーン抽出装置は、視聴者の主観的な印象によって、コンテンツから臨場感の高いシーンを抽出することができる。 Then, the scene extraction apparatus extracts, from the content, a scene corresponding to the time interval determined by the extraction unit as having a high sense of presence stored in the determination result storage unit. That is, the extraction unit extracts, from the content, a scene corresponding to the time from the beginning time of the first time interval to the end time of the last time interval, among the continuous time intervals determined to have high presence. .
As a result, the scene extraction device can extract a scene with high presence from the content based on the subjective impression of the viewer.

なお、シーン抽出装置は、コンピュータを、前記した各手段として機能させるためのシーン抽出プログラムで動作させることができる。 Note that the scene extraction apparatus can operate a computer with a scene extraction program for causing a computer to function as each of the above-described means.

本発明は、以下に示す優れた効果を奏するものである。
本発明によれば、従来のように、コンテンツの客観的な特徴ではなく、コンテンツに対する人が感じる主観的な印象に基づいて、コンテンツから臨場感の高いシーンを抽出することができる。
また、本発明によれば、コンテンツを構成する信号から、予め学習した臨場感学習データを用いて臨場感推定値を推定するため、高精度、かつ、自動的にコンテンツから臨場感の高いシーンを抽出することができる。 The present invention has the following excellent effects.
According to the present invention, it is possible to extract a highly realistic scene from content based on a subjective impression felt by a person with respect to the content instead of an objective feature of the content as in the past.
In addition, according to the present invention, since the realistic sensation estimated value is estimated from the signals constituting the content using the pre-learned realistic sensation learning data, a scene with high sensation of realism is automatically obtained from the content. Can be extracted.

本発明の第１実施形態に係るシーン抽出装置の構成を示すブロック構成図である。It is a block block diagram which shows the structure of the scene extraction apparatus which concerns on 1st Embodiment of this invention. 図１の聴覚臨場感推定手段の構成を示すブロック構成図である。It is a block block diagram which shows the structure of the auditory presence sense estimation means of FIG. 図１の視覚臨場感推定手段の構成を示すブロック構成図である。It is a block block diagram which shows the structure of the visual presence estimation means of FIG. 図１の感覚臨場感推定手段の構成を示すブロック構成図である。It is a block block diagram which shows the structure of the sense presence realistic estimation means of FIG. 臨場感を推定する時間区間を説明するための説明図である。It is explanatory drawing for demonstrating the time interval which estimates presence. 図１の臨場感推定値／判定結果記憶手段に記憶するデータの構造例を示すデータ構造図である。FIG. 2 is a data structure diagram showing an example of the structure of data stored in a realistic sensation estimated value / determination result storage unit of FIG. 本発明の第１実施形態に係るシーン抽出装置への設定装置による設定動作を示すフローチャートである。It is a flowchart which shows the setting operation | movement by the setting apparatus to the scene extraction apparatus which concerns on 1st Embodiment of this invention. 本発明の第１実施形態に係るシーン抽出装置の動作を示すフローチャートである。It is a flowchart which shows operation | movement of the scene extraction apparatus which concerns on 1st Embodiment of this invention. 図８の臨場感推定動作の詳細を示すフローチャートである。It is a flowchart which shows the detail of realistic presence estimation operation | movement of FIG. 図８の臨場感判定動作の詳細を示すフローチャートである。It is a flowchart which shows the detail of the presence determination operation | movement of FIG. 本発明の第２実施形態に係るシーン抽出装置の構成を示すブロック構成図である。It is a block block diagram which shows the structure of the scene extraction apparatus which concerns on 2nd Embodiment of this invention. 図１１の臨場感推定値／判定結果記憶手段に記憶するデータの構造例を示すデータ構造図である。FIG. 12 is a data structure diagram illustrating a structure example of data stored in the realistic sensation estimated value / determination result storage unit of FIG. 11.

以下、本発明の実施形態について図面を参照して説明する。
≪第１実施形態≫
〔シーン抽出装置の構成〕
まず、図１を参照して、本発明の第１実施形態に係るシーン抽出装置１の構成について説明する。 Embodiments of the present invention will be described below with reference to the drawings.
<< First Embodiment >>
[Configuration of scene extraction device]
First, the configuration of the scene extraction device 1 according to the first embodiment of the present invention will be described with reference to FIG.

シーン抽出装置１は、映像音声コンテンツ（以下、コンテンツ）Ｃから、臨場感の高いシーンを抽出するものである。このシーン抽出装置１は、抽出したシーンを表示装置２に出力する。
なお、コンテンツＣは、映像信号および音響信号以外にも、時系列の感覚信号を含んでいてもよい。この感覚信号は、例えば、ホームシアター、映画館等の椅子に与える振動を再現するための振動の大きさを振幅で表した振動信号とすることができる。あるいは、感覚信号は、椅子の傾きを表す時系列の角度信号であってもよい。
また、ここでは、シーン抽出装置１は、ＰＣ（パーソナルコンピュータ）等の設定装置３を外部に接続し、予め各種の設定情報を設定されるものとする。
図１に示すように、シーン抽出装置１は、臨場感推定手段１０と、臨場感学習データ記憶手段２０と、臨場感推定値／判定結果記憶手段３０と、設定情報記憶手段４０と、判定手段５０と、抽出手段６０と、を備える。 The scene extraction apparatus 1 extracts a highly realistic scene from video / audio content (hereinafter, content) C. The scene extraction device 1 outputs the extracted scene to the display device 2.
Note that the content C may include a time-series sensory signal in addition to the video signal and the audio signal. This sensory signal can be, for example, a vibration signal that represents, in amplitude, the magnitude of vibration for reproducing vibration given to a chair of a home theater, a movie theater, or the like. Alternatively, the sensory signal may be a time-series angle signal representing the tilt of the chair.
Here, it is assumed that the scene extraction device 1 is connected to a setting device 3 such as a PC (personal computer) to the outside, and various setting information is set in advance.
As shown in FIG. 1, the scene extraction apparatus 1 includes a realistic sensation estimation unit 10, a realistic sensation learning data storage unit 20, a realistic sensation estimated value / determination result storage unit 30, a setting information storage unit 40, and a determination unit. 50 and extraction means 60.

臨場感推定手段１０は、設定情報記憶手段４０から時間間隔、時間幅を読み込み、コンテンツＣから時間単位で臨場感の度合いを示す臨場感推定値を算出するものである。すなわち、臨場感推定手段１０は、重複区間を含んで所定の時間間隔だけずれた所定の時間幅の時間区間ごとに、コンテンツＣの特徴量を分析し、臨場感学習データ記憶手段２０に記憶されている臨場感学習データから、臨場感の度合いを臨場感推定値として算出する。
ここでは、臨場感推定手段１０は、聴覚臨場感推定手段１１と、視覚臨場感推定手段１２と、感覚臨場感推定手段１３と、臨場感特定手段１４と、を備える。 The realistic sensation estimation means 10 reads the time interval and the time width from the setting information storage means 40, and calculates the realistic sensation estimation value indicating the degree of realism from the content C in units of time. In other words, the realistic sensation estimation means 10 analyzes the feature amount of the content C for each time interval having a predetermined time width that is shifted by a predetermined time interval including the overlapping interval, and is stored in the realistic sensation learning data storage means 20. From the sense of presence learning data, the degree of presence is calculated as an estimated presence value.
Here, the realistic sensation estimation means 10 includes an auditory realistic sensation estimation means 11, a visual presence sensation estimation means 12, a sensory presence sensation estimation means 13, and a presence sensation identification means 14.

聴覚臨場感推定手段１１は、コンテンツＣの音響信号から、臨場感推定値を算出するものである。この聴覚臨場感推定手段１１は、図２に示すような構成とすることができる。
図２に示すように、聴覚臨場感推定手段１１は、音響信号分析手段１１０と、聴覚臨場感算出手段１１１と、を備える。 The auditory realistic sensation estimation unit 11 calculates an estimated realistic sensation value from the sound signal of the content C. This auditory presence estimation means 11 can be configured as shown in FIG.
As shown in FIG. 2, the auditory presence estimation unit 11 includes an acoustic signal analysis unit 110 and an auditory presence calculation unit 111.

音響信号分析手段１１０は、所定の時間間隔ごとに、所定の時間幅で、コンテンツＣの音響信号を分析して音響特徴量を抽出するものである。ここでいう時間間隔は、音響特徴量を抽出する時系列の時間区間において、時間区間の先頭の時間と、次の時間区間の先頭の時間との間隔をいう。また、ここでいう時間幅は、音響特徴量を抽出する時間区間の長さをいう。
この音響信号分析手段１１０は、例えば、図５に示すように、所定の時間間隔として０．５秒間隔、所定の時間幅として３秒幅の音響信号を逐次分析する。 The acoustic signal analyzing unit 110 analyzes the acoustic signal of the content C and extracts an acoustic feature amount with a predetermined time width at predetermined time intervals. The time interval here refers to the interval between the first time in the time interval and the first time in the next time interval in the time-series time interval from which the acoustic feature value is extracted. In addition, the time width here refers to the length of the time interval in which the acoustic feature value is extracted.
For example, as shown in FIG. 5, the acoustic signal analyzing unit 110 sequentially analyzes acoustic signals having a predetermined time interval of 0.5 seconds and a predetermined time width of 3 seconds.

また、音響信号分析手段１１０は、音響特徴量として、例えば、ラウドネス、シャープネス、ラフネス、ダイナミックレンジ（９５％時間率音圧レベルに対する５％時間率音圧レベルの相対レベル）、音像の動き（動きの有無〔１，０〕、または、仰角・水平角の変化で表現）等を求める。また、音響信号が左右２チャンネルであれば、音響信号分析手段１１０は、両耳間相互相関度、両耳間レベル差、両耳間位相差を音響特徴量としてもよい。 Further, the acoustic signal analysis unit 110 may include, for example, loudness, sharpness, roughness, dynamic range (relative level of 5% time rate sound pressure level with respect to 95% time rate sound pressure level), sound image motion (motion) as acoustic feature amounts. (Represented by changes in elevation angle / horizontal angle). Further, if the acoustic signal has two left and right channels, the acoustic signal analyzing unit 110 may use the interaural cross-correlation, the interaural level difference, and the interaural phase difference as the acoustic feature amount.

なお、これら音響分析の手法は、一般的なものであるため説明を省略する。また、音響信号分析手段１１０が求める音響特徴量は、音響分析可能なものであれば、例示したものに限定されないことは言うまでもない。
ただし、音響信号分析手段１１０が求める音響特徴量は、後記する臨場感学習データ記憶手段２０に記憶されている臨場感学習データを学習する際に用いた音響特徴量と同じ特徴量とする。この音響信号分析手段１１０は、所定の時間間隔および所定の時間幅で求めた音響特徴量を、順次、聴覚臨場感算出手段１１１に出力する。 Note that these acoustic analysis methods are general and will not be described. Further, it goes without saying that the acoustic feature amount obtained by the acoustic signal analyzing means 110 is not limited to the exemplified one as long as acoustic analysis is possible.
However, the acoustic feature amount obtained by the acoustic signal analyzing unit 110 is the same as the acoustic feature amount used when learning the realistic sense learning data stored in the realistic sense learning data storage unit 20 described later. The acoustic signal analyzing unit 110 sequentially outputs the acoustic feature amounts obtained at a predetermined time interval and a predetermined time width to the auditory presence sense calculating unit 111.

聴覚臨場感算出手段１１１は、臨場感学習データ記憶手段２０に記憶されている臨場感学習データに基づいて、音響信号分析手段１１０で分析された音響特徴量から聴覚臨場感推定値（個別臨場感推定値）を算出するものである。
ここでは、聴覚臨場感算出手段１１１は、臨場感学習データとして予め学習してある、複数の音響特徴量から聴覚臨場感推定値を算出するニューラルネットワークの関数により、入力した所定の時間幅の複数の音響特徴量から聴覚臨場感推定値を算出する。なお、聴覚臨場感推定値は、正規化（例えば、０〜１の範囲）された値とする。
この聴覚臨場感算出手段１１１は、算出した聴覚臨場感推定値を臨場感特定手段１４に出力する。
図１に戻って、シーン抽出装置１の構成について説明を続ける。 The auditory realistic sensation calculating means 111 is based on the realistic sensation learning data stored in the realistic sensation learning data storage means 20 and is based on the acoustic feature amount analyzed by the acoustic signal analyzing means 110 (individual realistic sensation). Estimated value) is calculated.
Here, the auditory realistic sensation calculating means 111 uses a neural network function that calculates an auditory realistic sensation estimated value from a plurality of acoustic feature quantities previously learned as realistic sensation learning data. An auditory realistic sensation estimate is calculated from the acoustic feature value. Note that the auditory presence sense value is a normalized value (for example, a range of 0 to 1).
The auditory realistic sensation calculating unit 111 outputs the calculated auditory realistic sensation estimated value to the realistic sensation specifying unit 14.
Returning to FIG. 1, the description of the configuration of the scene extraction device 1 will be continued.

視覚臨場感推定手段１２は、コンテンツＣの映像信号から、視覚臨場感推定値を算出するものである。この視覚臨場感推定手段１２は、図３に示すような構成とすることができる。
図３に示すように、視覚臨場感推定手段１２は、映像信号分析手段１２０と、視覚臨場感算出手段１２１と、を備える。 The visual presence estimation means 12 calculates a visual presence estimation value from the video signal of the content C. The visual presence estimation means 12 can be configured as shown in FIG.
As shown in FIG. 3, the visual presence estimation unit 12 includes a video signal analysis unit 120 and a visual presence calculation unit 121.

映像信号分析手段１２０は、所定の時間間隔ごとに、所定の時間幅で、コンテンツＣの映像信号を分析して映像特徴量を抽出するものである。
この映像信号の分析を行う時間間隔は、音響信号分析手段１１０が音響信号を分析する時間と同じとする。
この映像信号分析手段１２０は、映像特徴量として、例えば、所定の時間幅のフレーム区間における輝度特徴（輝度の平均値、標準偏差、歪度）、彩度特徴（彩度の平均値、標準偏差、歪度）、色相特徴（赤、黄、青等の所定の色相値ごとの平均画素数）、移動物体特徴（移動物体が占める画素数の平均値、移動物体が占める画素数の９５パーセンタイル値と５パーセンタイル値との差）等を求める。 The video signal analyzing unit 120 analyzes the video signal of the content C at a predetermined time interval and extracts a video feature amount at predetermined time intervals.
The time interval for analyzing the video signal is the same as the time for the acoustic signal analyzing unit 110 to analyze the acoustic signal.
The video signal analysis unit 120 uses, for example, luminance features (luminance average value, standard deviation, skewness) and chroma features (saturation average value, standard deviation) in a frame section of a predetermined time width as video feature amounts. , Skewness), hue characteristics (average number of pixels for each predetermined hue value such as red, yellow, and blue), moving object characteristics (average value of the number of pixels occupied by the moving object, 95th percentile value of the number of pixels occupied by the moving object) And the fifth percentile value).

なお、これら映像分析の手法は、一般的なものであるため説明を省略する。また、映像信号分析手段１２０が求める映像特徴量は、映像分析可能なものであれば、例示したものに限定されないことは言うまでもない。
ただし、映像信号分析手段１２０が求める映像特徴量は、後記する臨場感学習データ記憶手段２０に記憶されている臨場感学習データを学習する際に用いた映像特徴量と同じ特徴量とする。
この映像信号分析手段１２０は、所定の時間間隔および所定の時間幅で求めた映像特徴量を、順次、視覚臨場感算出手段１２１に出力する。 Note that these video analysis techniques are general and will not be described. Needless to say, the video feature amount required by the video signal analyzing unit 120 is not limited to the exemplified feature as long as video analysis is possible.
However, the video feature amount required by the video signal analysis unit 120 is the same as the video feature amount used when learning the realistic learning data stored in the realistic learning data storage unit 20 described later.
The video signal analysis unit 120 sequentially outputs video feature amounts obtained at a predetermined time interval and a predetermined time width to the visual presence calculation unit 121.

視覚臨場感算出手段１２１は、臨場感学習データ記憶手段２０に記憶されている臨場感学習データに基づいて、映像信号分析手段１２０で分析された映像特徴量から視覚臨場感推定値（個別臨場感推定値）を算出するものである。
ここでは、視覚臨場感算出手段１２１は、臨場感学習データとして予め学習してある、複数の映像特徴量から視覚臨場感推定値を算出するニューラルネットワークの関数により、入力した所定の時間幅の複数の映像特徴量から視覚臨場感推定値を算出する。なお、視覚臨場感推定値は、正規化（例えば、０〜１の範囲）された値とする。
この視覚臨場感算出手段１２１は、算出した視覚臨場感推定値を臨場感特定手段１４に出力する。
図１に戻って、シーン抽出装置１の構成について説明を続ける。 Based on the presence learning data stored in the presence learning data storage unit 20, the visual presence calculation unit 121 calculates a visual presence estimate (individual presence sense) from the video feature amount analyzed by the video signal analysis unit 120. Estimated value) is calculated.
Here, the visual presence calculation means 121 uses a function of a neural network that calculates a visual presence estimation value from a plurality of video feature quantities previously learned as the presence learning data, and inputs a plurality of predetermined time widths. The visual presence estimation value is calculated from the video feature amount. Note that the visual presence estimation value is a normalized value (for example, a range of 0 to 1).
The visual presence calculation means 121 outputs the calculated visual presence estimation value to the presence determination means 14.
Returning to FIG. 1, the description of the configuration of the scene extraction device 1 will be continued.

感覚臨場感推定手段１３は、コンテンツＣの感覚信号から、感覚臨場感推定値を算出するものである。この感覚臨場感推定手段１３は、図４に示すような構成とすることができる。
図４に示すように、感覚臨場感推定手段１３は、感覚信号分析手段１３０と、感覚臨場感算出手段１３１と、を備える。 The sensory realistic sensation estimation means 13 calculates a sensory realistic sensation estimated value from the sensory signal of the content C. This sensory presence estimation means 13 can be configured as shown in FIG.
As shown in FIG. 4, the sensory presence estimation unit 13 includes a sensory signal analysis unit 130 and a sensory presence calculation unit 131.

感覚信号分析手段１３０は、所定の時間間隔ごとに、所定の時間幅で、コンテンツＣの感覚信号を分析して感覚特徴量を抽出するものである。
この感覚信号の分析を行う時間間隔は、音響信号分析手段１１０が音響信号を分析する時間と同じとする。
この感覚信号分析手段１３０は、感覚信号として振動信号を用いる場合、例えば、所定の時間幅の振幅特徴（振幅の平均値、標準偏差、最大振幅と最小振幅との差）、周期情報（周期の平均値、標準偏差、最大周期と最小周期との差）等を求める。
また、感覚信号分析手段１３０は、感覚信号として角度信号を用いる場合、例えば、所定の時間幅の角度変位特徴として、水平からの角度の平均値、標準偏差、最大角度変位等を求める。 The sensory signal analyzing unit 130 analyzes the sensory signal of the content C at a predetermined time width and extracts a sensory feature amount at predetermined time intervals.
The time interval for analyzing the sensory signal is the same as the time for the acoustic signal analyzing unit 110 to analyze the acoustic signal.
When the sensory signal analyzing unit 130 uses a vibration signal as the sensory signal, for example, amplitude characteristics (average amplitude, standard deviation, difference between maximum amplitude and minimum amplitude) of predetermined time width, period information (period Average value, standard deviation, difference between the maximum period and the minimum period).
In addition, when the angle signal is used as the sensory signal, the sensory signal analysis unit 130 obtains, for example, an average value of the angle from the horizontal, a standard deviation, a maximum angular displacement, and the like as the angle displacement feature having a predetermined time width.

なお、これら分析手法は、一般的なものであるため説明を省略する。また、感覚信号分析手段１３０が求める感覚特徴量は、感覚として分析可能なものであれば、例示したものに限定されないことは言うまでもない。
ただし、感覚信号分析手段１３０が求める感覚特徴量は、後記する臨場感学習データ記憶手段２０に記憶されている臨場感学習データを学習する際に用いた感覚特徴量と同じ特徴量とする。
この感覚信号分析手段１３０は、所定の時間間隔および所定の時間幅で求めた感覚特徴量を、順次、感覚臨場感算出手段１３１に出力する。 In addition, since these analysis methods are general, description is abbreviate | omitted. Needless to say, the sensory feature amount required by the sensory signal analysis unit 130 is not limited to the exemplified one as long as it can be analyzed as a sensory sense.
However, the sensory feature value calculated by the sensory signal analysis unit 130 is the same as the sensory feature value used when learning the realistic sense learning data stored in the realistic sense learning data storage unit 20 described later.
The sensory signal analyzing unit 130 sequentially outputs the sensory feature amounts obtained at predetermined time intervals and predetermined time widths to the sensory presence calculating unit 131.

感覚臨場感算出手段１３１は、臨場感学習データ記憶手段２０に記憶されている臨場感学習データに基づいて、感覚信号分析手段１３０で分析された感覚特徴量から感覚臨場感推定値（個別臨場感推定値）を算出するものである。
ここでは、感覚臨場感算出手段１３１は、臨場感学習データとして予め学習してある、複数の感覚特徴量から感覚臨場感推定値を算出するニューラルネットワークの関数により、入力した所定の時間幅の複数の感覚特徴量から感覚臨場感推定値を算出する。なお、感覚臨場感推定値は、正規化（例えば、０〜１の範囲）された値とする。
この感覚臨場感算出手段１３１は、算出した感覚臨場感推定値を臨場感特定手段１４に出力する。 The sensory realistic sensation calculating means 131 is based on the realistic sensation learning data stored in the realistic sensation learning data storage means 20 and is based on the sensory feature amount analyzed by the sensory signal analyzing means 130 (individual presence sensation). Estimated value) is calculated.
Here, the sensory realistic sensation calculating means 131 uses a neural network function that calculates a sensory realistic sensation estimated value from a plurality of sensory feature quantities previously learned as realistic sensation learning data. An estimated sensory presence value is calculated from the sensory feature amount. Note that the sense realistic sensation estimated value is a normalized value (for example, a range of 0 to 1).
The sensory realistic sensation calculating means 131 outputs the calculated sensory realistic sensation estimated value to the realistic sensation specifying means 14.

臨場感特定手段１４は、聴覚臨場感推定手段１１、視覚臨場感推定手段１２および感覚臨場感推定手段１３でそれぞれ推定された時間区間の個別臨場感推定値から、当該時間区間の代表となる臨場感推定値（代表値）を特定するものである。
ここでは、臨場感特定手段１４は、個別臨場感推定値（聴覚臨場感推定値、視覚臨場感推定値、感覚臨場感推定値）のうちの最大値を当該時間区間における臨場感推定値とする。ただし、この臨場感特定手段１４は、必ずしも最大値で臨場感推定値を特定する必要はなく、例えば、平均値、重み付け加算等の統計量によって、臨場感推定値を特定することとしてもよい。 The realistic sensation specifying means 14 is a representative representative of the time interval from the individual realistic sensation estimated values of the time intervals estimated by the auditory realistic sensation estimation means 11, the visual presence sensation estimation means 12, and the sensory presence sensation feeling estimation means 13, respectively. This is to identify the estimated feeling value (representative value).
Here, the realistic sensation specifying means 14 sets the maximum value among the individual realistic sensation estimated values (auditory realistic sensation estimated value, visual realistic sensation estimated value, sensory realistic sensation estimated value) as the realistic sensation estimated value in the time interval. . However, the presence feeling specifying unit 14 does not necessarily need to specify the presence feeling estimated value with the maximum value. For example, the presence feeling estimated value may be specified by a statistic such as an average value or weighted addition.

なお、重み付け加算で臨場感推定値を特定する場合、臨場感特定手段１４は、予め設定情報記憶手段４０に記憶させたそれぞれの個別臨場感推定値（聴覚臨場感推定値、視覚臨場感推定値、感覚臨場感推定値）の重み係数を読み込む。あるいは、コンテンツＣに予めコンテンツの内容を分類した情報をタグとして付加しておき、臨場感特定手段１４は、重み係数として、タグで示される分類ごとに予め定めた値を用いることとしてもよい。例えば、コンテンツＣの分類が「音楽番組」であれば、聴覚臨場感推定値の重みを大きくし、コンテンツＣの分類が「紀行番組」であれば、視覚臨場感推定値の重みを大きくする等である。
この臨場感特定手段１４は、特定した臨場感推定値を、コンテンツＣの先頭からの経過時間に対応付けて、順次、臨場感推定値／判定結果記憶手段３０に書き込み記憶する。 When the realistic sensation estimated value is specified by weighted addition, the realistic sensation specifying unit 14 stores the individual realistic sensation estimated values (auditory realistic sensation estimated value, visual realistic sensation estimated value) stored in the setting information storage unit 40 in advance. ), And the weight coefficient of the sense of realistic sensation) is read. Alternatively, information that classifies the contents of the content in advance as a tag may be added as a tag, and the presence specifying unit 14 may use a value that is predetermined for each classification indicated by the tag as the weighting coefficient. For example, if the category of the content C is “music program”, the weight of the auditory presence sense value is increased, and if the category of the content C is “travel program”, the weight of the visual presence estimate value is increased. It is.
The presence feeling specifying unit 14 sequentially writes and stores the specified presence feeling estimated value in association with the elapsed time from the top of the content C in the presence feeling estimated value / determination result storage unit 30.

臨場感学習データ記憶手段２０は、コンテンツＣを構成する信号（音響信号、映像信号および感覚信号）の特徴量に対する臨場感の度合いを予め学習した学習データ（臨場感学習データ）を記憶するものである。この臨場感学習データ記憶手段２０は、ハードディスク等の一般的な記録媒体で構成することができる。
この臨場感学習データは、複数のコンテンツにおいて複数の被験者が主観評価した結果を予め学習したものであって、複数の特徴量から臨場感を推定するモデルとして、ニューラルネットワーク等で定義される。 The sense of presence learning data storage means 20 stores learning data (presence learning data) obtained by learning in advance the degree of sense of presence with respect to the feature quantities of signals (acoustic signals, video signals, and sensory signals) constituting the content C. is there. The presence learning data storage means 20 can be configured by a general recording medium such as a hard disk.
The sense of presence learning data is obtained by learning in advance the results of subjective evaluation by a plurality of subjects in a plurality of contents, and is defined by a neural network or the like as a model for estimating the sense of presence from a plurality of feature amounts.

例えば、臨場感学習データの学習は、評価用コンテンツに含まれる音響信号、映像信号および感覚信号をそれぞれ被験者に提示し、それぞれの信号ごとに、臨場感を「全く感じない」から「非常に感じる」までを複数のレベル（例えば、７段階）で時間の経過とともに評価してもらう。そして、音響信号、映像信号および感覚信号のそれぞれの所定の時間幅（例えば、０．５秒）ごとの特徴量と、その時間幅の区間における全被験者が感じた臨場感のレベルの平均値とから、ニューラルネットワークを学習する。ここでは、ニューラルネットの出力は、正規化（例えば、０〜１の範囲）された値とする。 For example, in the learning of realistic sense learning data, an audio signal, a video signal, and a sensory signal included in the evaluation content are presented to the subject, and for each signal, the realistic feeling is “not felt at all” to “very felt” Are evaluated over time at multiple levels (eg, 7 levels). And the feature value for each predetermined time width (for example, 0.5 seconds) of each of the audio signal, the video signal, and the sensory signal, and the average value of the level of presence felt by all subjects in the interval of the time width To learn a neural network. Here, the output of the neural network is a normalized value (for example, a range of 0 to 1).

なお、臨場感学習データの学習における音響信号の特徴量は、音響信号分析手段１１０（図２参照）で分析する音響特徴量と同じものである。また、臨場感学習データの学習における映像信号の特徴量は、映像信号分析手段１２０（図３参照）で分析する映像特徴量と同じものである。また、臨場感学習データの学習における感覚信号の特徴量は、感覚信号分析手段１３０（図４参照）で分析する感覚特徴量と同じものである。
このように、臨場感学習データをニューラルネットワークで学習することで、複数の特徴量から臨場感を推定するモデルを構築することができる。もちろん、この臨場感学習データは、ニューラルネットワークに限定されず、一般的な機械学習によって、複数の特徴量から臨場感を推定するモデルを構築することができる。 Note that the feature amount of the acoustic signal in the learning of the realistic sense learning data is the same as the acoustic feature amount analyzed by the acoustic signal analysis unit 110 (see FIG. 2). Further, the feature amount of the video signal in the learning of the realistic sense learning data is the same as the video feature amount analyzed by the video signal analysis unit 120 (see FIG. 3). In addition, the sensory signal feature amount in the learning of the realistic sense learning data is the same as the sensory feature amount analyzed by the sensory signal analysis unit 130 (see FIG. 4).
As described above, by learning the realistic sense learning data with the neural network, it is possible to construct a model for estimating the realistic sense from a plurality of feature amounts. Of course, the realistic sense learning data is not limited to the neural network, and a model for estimating the realistic sense from a plurality of feature amounts can be constructed by general machine learning.

臨場感推定値／判定結果記憶手段（臨場感推定値記憶手段、判定結果記憶手段）３０は、臨場感推定手段１０で推定されたコンテンツＣの時間区間ごとの臨場感推定値を、コンテンツＣの先頭からの経過時間に対応付けて記憶するものである。
さらに、臨場感推定値／判定結果記憶手段３０は、各時間区間の臨場感推定値が視聴者に対して臨場感を与えるか否かについて、判定手段５０によって判定した結果（例えば、フラグ）を記録する領域を有する。 The realistic sensation estimated value / determination result storage means (the realistic sensation estimated value storage means, the determination result storage means) 30 calculates the realistic sensation estimated value for each time interval of the content C estimated by the realistic sensation estimation means 10. This is stored in association with the elapsed time from the beginning.
Further, the realistic sensation estimated value / determination result storage means 30 indicates the result (for example, flag) determined by the determination means 50 as to whether or not the realistic sensation estimated value in each time interval gives the viewer a sense of realism. It has an area to record.

例えば、臨場感推定値／判定結果記憶手段３０は、図６に示すように、時間（先頭からの経過時間）、臨場感推定値、判定結果（総合判定結果、閾値判定結果、順位）を記録する領域を有する。時間および臨場感推定値は、臨場感推定手段１０によって書き込まれ、判定結果は、判定手段５０によって書き込まれる。この判定結果における総合判定結果、閾値判定結果、順位については、判定手段５０の説明において詳細に説明する。 For example, as shown in FIG. 6, the realistic sensation estimated value / determination result storage means 30 records time (elapsed time from the beginning), realistic sensation estimated value, and determination result (overall determination result, threshold determination result, rank). It has the area to do. The time and the realistic sensation estimation value are written by the realistic sensation estimation means 10, and the determination result is written by the determination means 50. The overall determination result, the threshold determination result, and the rank in the determination result will be described in detail in the description of the determination unit 50.

この臨場感推定値／判定結果記憶手段３０は、半導体メモリ等の一般的な記録媒体で構成することができる。なお、臨場感推定値／判定結果記憶手段３０は、時間区間ごとに臨場感推定値を記憶する臨場感推定値記憶手段、時間区間ごとに判定結果を記憶する判定結果記憶手段のように個別の手段として構成してもよい。 The realistic sensation estimated value / determination result storage means 30 can be composed of a general recording medium such as a semiconductor memory. Note that the realistic sensation estimated value / determination result storage means 30 is an individual sensation estimated value storage means for storing the realistic sensation estimated value for each time interval, and a determination result storage means for storing the determination result for each time interval. You may comprise as a means.

設定情報記憶手段４０は、設定装置３で設定された各種情報を記憶するものであって、半導体メモリ等の一般的な記録媒体で構成することができる。ここでは、設定情報記憶手段４０は、設定装置３によって、閾値、抽出条件、糊代時間が書き込まれる。 The setting information storage means 40 stores various information set by the setting device 3 and can be configured by a general recording medium such as a semiconductor memory. Here, the setting information storage means 40 is written by the setting device 3 with the threshold value, the extraction condition, and the paste margin time.

閾値は、臨場感が高いか否かの判定基準となる値である。ここでは、設定装置３が、表示装置２に表示した設定画面を介して閾値を入力し、シーン抽出装置１に設定する。
この閾値は、固定の値である絶対閾値としてもよいし、コンテンツＣ内の時間長によって変動する値である相対閾値としてもよい。
例えば、閾値を絶対閾値とする場合、設定装置３は、臨場感推定値の範囲内（例えば、０〜１の範囲）で、閾値の設定を行う。すなわち、設定装置３は、表示装置２の画面上に表示した設定画面を介して視聴者によって設定される値（例えば、０．８）を閾値として設定する。 The threshold value is a value serving as a criterion for determining whether or not the presence is high. Here, the setting device 3 inputs a threshold value via the setting screen displayed on the display device 2 and sets it in the scene extraction device 1.
This threshold value may be an absolute threshold value that is a fixed value, or may be a relative threshold value that varies depending on the time length in the content C.
For example, when the threshold value is an absolute threshold value, the setting device 3 sets the threshold value within the range of realistic sensation estimation values (for example, a range of 0 to 1). That is, the setting device 3 sets a value (for example, 0.8) set by the viewer via the setting screen displayed on the screen of the display device 2 as a threshold value.

また、例えば、閾値を相対閾値とする場合、設定装置３は、コンテンツＣ全体のうちで臨場感推定値の高い方からシーンを抽出する時間長あるいは割合を設定する。すなわち、設定装置３は、表示装置２の画面上に表示した設定画面を介して視聴者によって設定される時間長（例えば、５分）または割合（例えば、８０％）を、閾値を算出するための情報として設定する。この時間長あるいは割合は、後記する判定手段５０において、相対閾値を算出する際に使用される。 For example, when the threshold value is a relative threshold value, the setting device 3 sets a time length or a ratio for extracting a scene from the higher content C estimated value in the entire content C. That is, the setting device 3 calculates the threshold value for the time length (for example, 5 minutes) or the ratio (for example, 80%) set by the viewer via the setting screen displayed on the screen of the display device 2. Set as information. This time length or ratio is used when the relative threshold is calculated in the determination means 50 described later.

抽出条件は、シーンを抽出するための条件である。この抽出条件は、例えば、最短シーン抽出時間、最大シーン抽出数、表示順序等である。
最短シーン抽出時間は、臨場感の高いシーンとして少なくともその臨場感が継続する最短時間である。すなわち、最短シーン抽出時間は、たとえ、臨場感が高いとして判定されたシーンであっても、その時間が短時間である場合には、抽出を行わないようにする制限時間である。
最大シーン抽出数は、抽出するシーンの最大数である。すなわち、最大シーン抽出数は、コンテンツＣに臨場感の高いシーンが多く含まれている場合、シーンを抽出する数の上限を示す。
表示順序は、臨場感の高いシーンとして表示装置２に抽出して表示するシーンの表示の順番を示す。例えば、表示順序は、コンテンツＣの再生時間に合わせた順番とする、臨場感推定値の高い順番とする等である。 The extraction condition is a condition for extracting a scene. The extraction conditions are, for example, the shortest scene extraction time, the maximum number of scene extractions, the display order, and the like.
The shortest scene extraction time is at least the shortest time that the realistic sensation continues as a highly realistic scene. In other words, the shortest scene extraction time is a time limit that prevents extraction even if a scene is determined to have a high sense of realism if the time is short.
The maximum number of scene extractions is the maximum number of scenes to be extracted. That is, the maximum number of scene extractions indicates the upper limit of the number of scenes to be extracted when the content C includes many scenes with a high presence.
The display order indicates the display order of scenes that are extracted and displayed on the display device 2 as highly realistic scenes. For example, the display order may be an order in accordance with the playback time of the content C, an order with a higher presence estimate value, or the like.

ここでは、設定装置３が、表示装置２に表示した設定画面を介して、最短シーン抽出時間、最大シーン抽出数、表示順序等を識別する識別情報とともに、各情報の具体的な値（抽出時間、抽出数等）を入力し、シーン抽出装置１に設定する。また、表示順序は、再生時間順、臨場感推定値の高い順等に予め固有の値を定めておく。 Here, the setting device 3 uses the setting screen displayed on the display device 2 to identify the shortest scene extraction time, the maximum number of scene extractions, the identification order for identifying the display order, and the specific values (extraction time of each information). , The number of extractions, etc.) are input and set in the scene extraction apparatus 1. In addition, the display order is determined in advance in a specific order such as the order of reproduction time, the order of higher realistic estimation values, and the like.

糊代時間は、臨場感の高いシーンとして抽出するシーンの前後に余分に付加する糊代の時間である。この糊代時間は、臨場感の高いシーンだけを厳密に抽出するのではなく、前後の余韻を含めて抽出を行うための時間である。なお、糊代時間は、シーン前後で同じ時間を設定しても、異なる時間を設定してもよい。ここでは、設定装置３が、表示装置２に表示した設定画面を介して、糊代時間を入力し、シーン抽出装置１に設定する。 The paste margin time is a paste margin time that is added extra before and after the scene extracted as a highly realistic scene. This glue allowance time is a time for extracting not only a scene with high presence, but also including the lingering sound before and after. The pasting time may be set to the same time before and after the scene, or may be set to different times. Here, the setting device 3 inputs the paste allowance time via the setting screen displayed on the display device 2 and sets it in the scene extraction device 1.

判定手段５０は、設定情報記憶手段４０に記憶されている設定情報に基づいて、臨場感推定手段１０で臨場感推定値が対応付けられた時間区間のシーンの抽出を行うか否かを判定するものである。
ここでは、判定手段５０は、図６に示すように、判定結果として、閾値判定結果と、順位と、総合判定結果とを臨場感推定値／判定結果記憶手段３０に書き込む。 Based on the setting information stored in the setting information storage unit 40, the determination unit 50 determines whether or not the presence estimation unit 10 should extract a scene in a time interval associated with the presence estimation value. Is.
Here, as shown in FIG. 6, the determination unit 50 writes the threshold determination result, the rank, and the overall determination result in the realistic sensation estimated value / determination result storage unit 30 as the determination result.

「閾値判定結果」は、当該時間区間が、設定された閾値以上の臨場感推定値であるか否かの判定結果を示す。具体的には、判定手段５０は、設定情報記憶手段４０に記憶されている閾値と、臨場感推定値／判定結果記憶手段３０に記憶されている臨場感推定値とを比較し、臨場感推定値が閾値以上であれば、当該時間区間は臨場感が高いと判定し、フラグをセットする。一方、判定手段５０は、臨場感推定値が閾値未満であれば、当該時間区間は臨場感が高くないと判定し、フラグをリセットする。例えば、図６の例では、閾値として“０．７”が設定され、臨場感推定値が閾値以上の時間区間（ｔ１，ｔ２，ｔ１０）に対してフラグ（“１”）を設定した状態を示している。 The “threshold determination result” indicates a determination result as to whether or not the time interval is an estimated presence value equal to or greater than a set threshold. Specifically, the determination unit 50 compares the threshold value stored in the setting information storage unit 40 with the realistic sense estimated value / determined result storage unit 30 to store the realistic sense estimate. If the value is greater than or equal to the threshold value, it is determined that the time interval is highly realistic and a flag is set. On the other hand, if the estimated presence value is less than the threshold value, the determination unit 50 determines that the time interval is not high and resets the flag. For example, in the example of FIG. 6, a state in which “0.7” is set as the threshold and the flag (“1”) is set for the time interval (t1, t2, t10) where the realistic sensation estimation value is equal to or greater than the threshold. Show.

なお、閾値として、相対閾値を用いる場合、判定手段５０は、コンテンツＣ全体のうちで臨場感推定値の高い方から設定情報記憶手段４０に設定されている時間長あるいは割合に達する臨場感推定値を閾値とする。すなわち、判定手段５０は、臨場感推定値／判定結果記憶手段３０に記憶されている臨場感推定値の上位から、設定される時間長（例えば、５分）または割合（例えば、８０％）に達する臨場感推定値を閾値とする。 When a relative threshold value is used as the threshold value, the determination unit 50 determines the realistic feeling estimated value that reaches the time length or ratio set in the setting information storage unit 40 from the higher realistic feeling estimated value in the entire content C. Is a threshold value. That is, the determination unit 50 sets the time length (for example, 5 minutes) or the ratio (for example, 80%) from the higher level of the realistic presence estimated value / determination result storage unit 30 stored in the presence feeling estimated value. The estimated realistic sensation reached is set as a threshold value.

「順位」は、「閾値判定結果」で臨場感が高いと判定された（フラグがセットされた）連続した時間区間ごとに、その区間に設定されている最も高い臨場感推定値に基づいて上位から順に順位を付けた番号である。具体的には、判定手段５０は、臨場感推定値／判定結果記憶手段３０に記憶されている閾値判定結果でフラグが連続する区間が、設定情報記憶手段４０に記憶されている最短シーン抽出時間以上である連続時間区間であるかを探索するとともに、それぞれの連続時間区間に設定されている最も高い臨場感推定値に基づいて、連続時間区間を臨場感の高い順番に順位付けする。
なお、最短シーン抽出時間が設定されていない場合、判定手段５０は、閾値判定結果でフラグが連続するすべての時間区間を順位付けの対象とする。もちろん、予め定めたデフォルト値を最短シーン抽出時間として設けておくこととしてもよい。 “Rank” ranks higher on the basis of the highest realistic sensation estimation value set for each consecutive time interval (flag is set) determined to be high in the “threshold determination result”. The numbers are assigned in order. Specifically, the determination unit 50 determines that the interval in which the flag continues in the threshold determination result stored in the realistic sensation estimated value / determination result storage unit 30 is the shortest scene extraction time stored in the setting information storage unit 40. Whether or not it is a continuous time interval as described above is searched, and the continuous time intervals are ranked in the order of the high presence based on the highest estimated realistic value set for each continuous time interval.
When the shortest scene extraction time is not set, the determination unit 50 sets all time intervals in which the flag is continuous as a threshold determination result as a target for ranking. Of course, a predetermined default value may be provided as the shortest scene extraction time.

例えば、図６の例では、閾値以上の臨場感推定値が最短シーン抽出時間以上連続する時間区間（ｔ１，ｔ２）、時間区間（ｔ１０）に、臨場感推定値の上位から、それぞれ、順位“１”、順位“１０”を設定した例を示している。もちろん、ここでは、中間の順位は図示を省略している。 For example, in the example of FIG. 6, the rankings “from the top of the realistic feeling estimated value to the time interval (t1, t2) and the time interval (t10) in which the estimated realistic feeling value equal to or greater than the threshold value continues for the shortest scene extraction time or longer, respectively. In this example, 1 ”and ranking“ 10 ”are set. Of course, the intermediate ranks are not shown here.

「総合判定結果」は、「順位」によって臨場感推定値の順に順位付けされた時間区間のシーンが、設定された最大シーン抽出数のシーンであるか否かの判定結果を示す。具体的には、判定手段５０は、順位付けされたシーンの順番に、設定情報記憶手段４０に記憶されている最大シーン抽出数までを、抽出対象のシーンとして、当該シーンに対応する時間区間のフラグをセットし、他の時間区間のフラグをリセットする。
なお、最大シーン抽出数が設定されていない場合、判定手段５０は、抽出数の制限を設けないこととする。もちろん、予め定めたデフォルト値を最大シーン抽出数として設けておくこととしてもよい。
例えば、図６の例で、最大シーン抽出数を“５”とした場合、順位“１”の時間区間（ｔ１，ｔ２）には、総合判定結果のフラグがセットされ、順位“１０”の時間区間（ｔ１０）にはフラグがセットされないことになる。 The “total determination result” indicates a determination result as to whether or not the scenes in the time interval ranked in the order of the realistic sensation estimation value by the “rank” are the scenes having the set maximum number of extracted scenes. Specifically, the determination unit 50 sets up to the maximum number of scene extractions stored in the setting information storage unit 40 in the order of the ranked scenes, and uses the time interval corresponding to the scene as the extraction target scene. Set flags and reset flags for other time intervals.
Note that when the maximum number of scene extractions is not set, the determination unit 50 does not limit the number of extractions. Of course, a predetermined default value may be provided as the maximum number of scene extractions.
For example, in the example of FIG. 6, when the maximum number of scene extractions is “5”, the flag of the comprehensive determination result is set in the time interval (t1, t2) of the rank “1”, and the time of the rank “10” The flag is not set in the section (t10).

この総合判定結果によって、抽出手段６０が、抽出するシーンを特定することができる。ここで、設定情報記憶手段４０に設定情報として糊代時間が設定されている場合、判定手段５０は、総合判定結果のフラグがセットされたシーンの前後の糊代時間分の時間区間にフラグをセットすることとする。これによって、抽出手段６０で抽出されるシーンの時間の前後に糊代時間分拡張されることになる。 Based on the comprehensive determination result, the extraction unit 60 can specify the scene to be extracted. Here, when the margin time is set as the setting information in the setting information storage unit 40, the determination unit 50 sets a flag in the time interval corresponding to the margin time before and after the scene in which the flag of the comprehensive determination result is set. It will be set. As a result, the pasting time is expanded before and after the scene time extracted by the extracting means 60.

また、設定情報記憶手段４０に設定情報として表示順序が設定されている場合、判定手段５０は、表示順序を示す値を臨場感推定値／判定結果記憶手段３０の総合判定結果のフラグがセットされたシーンごとに、表示順序の番号を付加することする（不図示）。なお、表示順序が設定されていない場合、判定手段５０は、予め定めたデフォルトの表示順序、例えば、臨場感推定値の高い順番で、表示順序の番号を付加する。なお、この表示順序は、臨場感推定値／判定結果記憶手段３０の「順序」（図６参照）を上書きして利用することとしてもよい。 When the display order is set as the setting information in the setting information storage unit 40, the determination unit 50 sets the value indicating the display order with the flag of the overall determination result in the realistic sensation estimated value / determination result storage unit 30. A display order number is added to each scene (not shown). When the display order is not set, the determination unit 50 adds display order numbers in a predetermined default display order, for example, in the order of higher realistic sensation estimation values. This display order may be used by overwriting the “order” (see FIG. 6) of the realistic sensation estimated value / determination result storage means 30.

抽出手段６０は、コンテンツＣから、臨場感の高いシーンを抽出するものである。この抽出手段６０は、コンテンツＣにおいて、臨場感推定値／判定結果記憶手段３０で臨場感の高いシーンとして判定されたシーンを抽出する。
具体的には、抽出手段６０は、臨場感推定値／判定結果記憶手段３０に記憶されている判定結果（具体的には、総合判定結果）でフラグがセットされているシーンを、設定されている表示順序で、最初の時間区間の先頭時間から、最後の時間区間の終了時間までの時間に相当する区間ごとに、コンテンツＣから順次読み込み再生し、表示装置２に出力する。 The extraction unit 60 extracts a scene with a high presence from the content C. The extracting unit 60 extracts scenes determined as high realistic scenes in the content C by the realistic presence estimated value / determination result storage unit 30.
Specifically, the extracting unit 60 is set with a scene in which a flag is set in the determination result (specifically, the comprehensive determination result) stored in the realistic sensation estimated value / determination result storage unit 30. In the display order, the content C is sequentially read and reproduced for each section corresponding to the time from the beginning time of the first time section to the end time of the last time section, and output to the display device 2.

このとき、判定手段５０が最短シーン抽出時間以上の連続時間区間に対してフラグを設定することで、抽出手段６０は、最短シーン抽出時間以上のシーンのみを抽出することができる。また、判定手段５０が最大シーン抽出数を上限として連続時間区間に対してフラグを設定することで、抽出手段６０は、抽出するシーンの数を制限してシーンを抽出することができる。
なお、コンテンツＣに振動信号のような感覚信号が付加されている場合、抽出手段６０は、コンテンツＣの抽出時間に対応する感覚信号を、図示を省略した再現装置（例えば、振動再現装置）に出力すればよい。
これによって、抽出手段６０は、視聴者が設定した抽出方法によって、コンテンツＣから、高臨場感のシーンを抽出することができる。 At this time, when the determination unit 50 sets a flag for a continuous time interval equal to or longer than the shortest scene extraction time, the extraction unit 60 can extract only a scene equal to or longer than the shortest scene extraction time. Further, the determination unit 50 sets a flag for the continuous time interval with the maximum number of scene extractions as an upper limit, so that the extraction unit 60 can extract the scenes by limiting the number of scenes to be extracted.
When a sensation signal such as a vibration signal is added to the content C, the extraction unit 60 applies a sensation signal corresponding to the extraction time of the content C to a reproduction device (for example, a vibration reproduction device) (not shown). Output.
Thereby, the extraction means 60 can extract a highly realistic scene from the content C by the extraction method set by the viewer.

以上説明したようにシーン抽出装置１を構成することで、シーン抽出装置１は、コンテンツＣの客観的な特徴量ではなく、視聴者の主観的な感覚に基づいて、臨場感の高いシーンを抽出することができる。 By configuring the scene extraction device 1 as described above, the scene extraction device 1 extracts a highly realistic scene based on the subjective sense of the viewer, not the objective feature amount of the content C. can do.

〔シーン抽出装置の動作〕
次に、図７〜図１０を参照（構成については適宜図１参照）して、本発明の第１実施形態に係るシーン抽出装置１の動作について説明する。なお、臨場感学習データ記憶手段２０には、音響信号、映像信号および感覚信号の特徴量に対する臨場感を予め学習した学習データ（臨場感学習データ）を記憶しておくものとする。また、設定情報記憶手段４０には、設定装置３によって予め各種設定情報を設定しておく。この設定動作について、図７を参照して説明する。 [Operation of the scene extraction device]
Next, the operation of the scene extraction apparatus 1 according to the first embodiment of the present invention will be described with reference to FIGS. It is assumed that the realistic sense learning data storage means 20 stores learning data (real sense learning data) in which the realistic sense with respect to the feature quantities of the sound signal, the video signal, and the sensory signal is learned in advance. In the setting information storage unit 40, various setting information is set in advance by the setting device 3. This setting operation will be described with reference to FIG.

図７に示すように、設定装置３は、臨場感が高いか否かの判定基準となる閾値をシーン抽出装置１に設定する（ステップＳ１００）。すなわち、ステップＳ１００では、設定装置３が、表示装置２に設定画面を表示し、視聴者による閾値の設定を受け付ける。この閾値は、設定情報記憶手段４０に記憶される。
なお、閾値として、相対閾値を用いる場合、設定装置３は、視聴者によって、コンテンツＣ全体のうちで臨場感推定値の高い方からシーンを抽出する時間長あるいは割合を受け付け、設定情報記憶手段４０に記憶する。 As illustrated in FIG. 7, the setting device 3 sets a threshold value serving as a criterion for determining whether or not presence is high in the scene extraction device 1 (step S100). That is, in step S100, the setting device 3 displays a setting screen on the display device 2, and accepts the setting of the threshold by the viewer. This threshold value is stored in the setting information storage unit 40.
When a relative threshold value is used as the threshold value, the setting device 3 accepts a time length or a ratio for extracting a scene from the one with the higher realistic sensation value in the entire content C by the viewer, and the setting information storage unit 40 To remember.

また、設定装置３は、臨場感の高いシーンとして表示装置２に表示するシーンの抽出条件をシーン抽出装置１に設定する（ステップＳ１０１）。すなわち、ステップＳ１０１では、設定装置３が、表示装置２に設定画面を表示し、視聴者による抽出条件の設定を受け付ける。この抽出条件は、例えば、最短シーン抽出時間、最大シーン抽出数、表示順序等である。この抽出条件は、設定情報記憶手段４０に記憶される。 Further, the setting device 3 sets, in the scene extraction device 1, a scene extraction condition to be displayed on the display device 2 as a highly realistic scene (step S101). That is, in step S101, the setting device 3 displays a setting screen on the display device 2, and accepts the setting of extraction conditions by the viewer. The extraction conditions are, for example, the shortest scene extraction time, the maximum number of scene extractions, the display order, and the like. This extraction condition is stored in the setting information storage unit 40.

また、設定装置３は、臨場感の高いシーンとして抽出するシーンの前後に余分に付加する糊代時間をシーン抽出装置１に設定する（ステップＳ１０２）。すなわち、ステップＳ１０２では、設定装置３が、表示装置２に設定画面を表示し、糊代時間の設定を受け付ける。この糊代時間は、設定情報記憶手段４０に記憶される。
なお、以上説明したステップＳ１００からステップＳ１０２までの動作は、どのような順番で行っても構わない。 Further, the setting device 3 sets, in the scene extraction device 1, an extra margin time to be added before and after the scene extracted as a highly realistic scene (step S102). That is, in step S102, the setting device 3 displays a setting screen on the display device 2 and accepts the setting of the glue allowance time. This paste allowance time is stored in the setting information storage means 40.
Note that the operations from step S100 to step S102 described above may be performed in any order.

次に、図８を参照（適宜図１参照）して、シーン抽出装置１の全体動作について説明する。
シーン抽出装置１は、まず、設定情報記憶手段４０から各種設定情報を読み込む（ステップＳ１）。
ステップＳ１による各種設定の読み込み後、シーン抽出装置１は、臨場感推定手段１０によって、コンテンツＣの所定の時間間隔ごとに、所定の時間幅における臨場感を推定する（ステップＳ２）。
このステップＳ２の動作については、図９を参照してさらに詳細に説明する。 Next, the overall operation of the scene extraction apparatus 1 will be described with reference to FIG. 8 (see FIG. 1 as appropriate).
The scene extraction apparatus 1 first reads various setting information from the setting information storage means 40 (step S1).
After reading the various settings in step S1, the scene extraction apparatus 1 estimates the presence in a predetermined time width for each predetermined time interval of the content C by the presence estimation means 10 (step S2).
The operation of step S2 will be described in more detail with reference to FIG.

図９に示すように、臨場感推定手段１０は、聴覚臨場感推定手段１１、視覚臨場感推定手段１２および感覚臨場感推定手段１３によって、複数の信号（映像信号、音響信号、感覚信号）ごとに、所定の時間間隔、所定の時間幅で臨場感推定値を算出する（ステップＳ２０）。
すなわち、聴覚臨場感推定手段１１は、コンテンツＣの音響信号から、臨場感学習データ記憶手段２０に記憶されている臨場感学習データに基づいて、聴覚臨場感推定値を算出する。また、視覚臨場感推定手段１２は、コンテンツＣの映像信号から、臨場感学習データに基づいて、視覚臨場感推定値を算出する。また、感覚臨場感推定手段１３は、コンテンツＣの感覚信号から、臨場感学習データに基づいて、感覚臨場感推定値を算出する。 As shown in FIG. 9, the realistic sensation estimation means 10 uses the auditory presence sensation estimation means 11, the visual presence sensation estimation means 12, and the sensory presence sensation estimation means 13 for each of a plurality of signals (video signal, sound signal, sensory signal). In addition, the realistic sensation estimated value is calculated at a predetermined time interval and a predetermined time width (step S20).
That is, the auditory realistic sensation estimation unit 11 calculates an auditory realistic sensation estimated value from the acoustic signal of the content C based on the realistic sensation learning data stored in the realistic sensation learning data storage unit 20. The visual presence estimation means 12 calculates a visual presence estimation value from the video signal of the content C based on the presence learning data. Further, the sensory realistic sensation estimation means 13 calculates a sensory realistic sensation estimated value from the sensory signal of the content C based on the realistic sensation learning data.

そして、シーン抽出装置１は、臨場感推定手段１０の臨場感特定手段１４によって、ステップＳ２０で算出された複数の臨場感推定値（聴覚臨場感推定値、視覚臨場感推定値、感覚臨場感推定値）の最大値を、その時間区間における臨場感推定値（代表値）として特定する（ステップＳ２１）。なお、ステップＳ２１では、最大値の代わりに、平均値、重み付け加算等の統計量を、その時間区間における臨場感推定値として特定してもよい。 Then, the scene extraction apparatus 1 uses the presence sensation specifying unit 14 of the presence sensation estimation unit 10 to calculate a plurality of presence sensation estimation values (audience presence sensation estimation values, visual presence sensation estimation values, sensory presence sensation estimations) calculated in step S20. The maximum value is specified as the realistic presence estimated value (representative value) in the time interval (step S21). In step S21, instead of the maximum value, a statistical amount such as an average value or weighted addition may be specified as the realistic sensation estimated value in the time interval.

そして、臨場感特定手段１４は、ステップＳ２１で特定された時間区間ごとの臨場感推定値（代表値）を、その時間区間に対応付けて臨場感推定値／判定結果記憶手段３０に書き込み記憶する（ステップＳ２２）。 Then, the realistic sensation specifying means 14 writes and stores the realistic sensation estimated value (representative value) for each time interval specified in step S21 in association with the time interval in the realistic sensation estimated value / determination result storage means 30. (Step S22).

ここで、シーン抽出装置１は、臨場感推定手段１０において、コンテンツＣを終了まで読み込んだか否かを判定する（ステップＳ２３）。そして、コンテンツＣのデータがまだ残っている場合（ステップＳ２３でＮｏ）、シーン抽出装置１は、ステップＳ２０に戻って次の時間区間での臨場感を推定する。
一方、コンテンツＣのデータの読み込みが完了した場合（ステップＳ２３でＹｅｓ）、シーン抽出装置１は、臨場感推定の動作を終了する。
このステップＳ２０からステップＳ２３までの動作によって、臨場感推定値が時系列に臨場感推定値／判定結果記憶手段３０に記憶されることになる。
図８に戻って、シーン抽出装置１の全体動作について説明を続ける。 Here, the scene extracting apparatus 1 determines whether or not the content C is read by the presence estimation means 10 until the end (step S23). If the data of the content C still remains (No in step S23), the scene extraction device 1 returns to step S20 and estimates the presence in the next time interval.
On the other hand, when the reading of the data of the content C is completed (Yes in step S23), the scene extraction device 1 ends the realistic feeling estimation operation.
Through the operations from step S20 to step S23, the realistic sense estimated value is stored in the realistic sense estimated value / determination result storage means 30 in time series.
Returning to FIG. 8, the overall operation of the scene extraction apparatus 1 will be described.

ステップＳ２による臨場感の推定の後、シーン抽出装置１は、判定手段５０によって、時間区間ごとに、視聴者に対して臨場感を与えるシーンであるか否かを判定し、その結果を記録する（ステップＳ３）。
このステップＳ３の動作については、図１０を参照してさらに詳細に説明する。 After the estimation of the presence in step S2, the scene extraction apparatus 1 determines whether the scene gives the viewer a sense of presence for each time interval by the determination unit 50, and records the result. (Step S3).
The operation in step S3 will be described in more detail with reference to FIG.

図１０に示すように、判定手段５０は、臨場感推定値／判定結果記憶手段３０に記憶されている臨場感推定値が、設定情報記憶手段４０に記憶されている閾値以上の時間区間に閾値判定結果（図６参照）のフラグをセットする（ステップＳ３０）。このとき、臨場感推定値が閾値未満の時間区間についてはフラグをリセットする。
なお、閾値として、相対閾値を用いる場合、判定手段５０は、コンテンツＣ全体のうちで臨場感推定値の高い方から設定情報記憶手段４０に設定されている時間長あるいは割合に達する臨場感推定値を閾値とする。 As shown in FIG. 10, the determination unit 50 has a threshold value in a time interval that is equal to or greater than the threshold value stored in the setting information storage unit 40. A flag of the determination result (see FIG. 6) is set (step S30). At this time, the flag is reset for a time interval in which the realistic sense estimated value is less than the threshold.
When a relative threshold value is used as the threshold value, the determination unit 50 determines the realistic feeling estimated value that reaches the time length or ratio set in the setting information storage unit 40 from the higher realistic feeling estimated value in the entire content C. Is a threshold value.

そして、判定手段５０は、臨場感推定値／判定結果記憶手段３０に記憶されているステップＳ３０でフラグが設定された時間区間で、設定情報記憶手段４０に記憶されている最短シーン抽出時間以上連続する連続時間区間を探索する（ステップＳ３１）。
その後、判定手段５０は、ステップＳ３１で探索された連続時間区間において、当該連続時間区間の最大臨場感推定値の高い方から順に順位付け（図６参照）を行う（ステップＳ３２）。 Then, the determination unit 50 continues for the minimum scene extraction time stored in the setting information storage unit 40 in the time interval in which the flag is set in step S30 stored in the realistic sensation estimated value / determination result storage unit 30. A continuous time interval is searched for (step S31).
Thereafter, the determination unit 50 ranks the continuous time intervals searched in step S31 in order from the highest maximum realistic sensation estimation value in the continuous time interval (see FIG. 6) (step S32).

そして、判定手段５０は、ステップＳ３２で順位付けされた連続時間区間において、設定情報記憶手段４０に記憶されている最大シーン抽出数のシーン（連続時間区間）に対応する各時間区間に総合判定結果（図６参照）のフラグをセットする（ステップＳ３３）。
さらに、判定手段５０は、設定情報記憶手段４０に設定情報として糊代時間が設定されている場合、総合判定結果のフラグがセットされたシーンの前後の糊代時間分の時間区間に総合判定結果のフラグをセットする（ステップＳ３４）。
このステップＳ３０からステップＳ３４までの動作によって、コンテンツＣ中、時間区間単位で臨場感が高いか否かの判定結果が臨場感推定値／判定結果記憶手段３０に記録されることになる。
図８に戻って、シーン抽出装置１の全体動作について説明を続ける。 Then, the determination unit 50 performs a comprehensive determination result for each time interval corresponding to the maximum number of scene extraction scenes (continuous time intervals) stored in the setting information storage unit 40 in the continuous time intervals ranked in step S32. The flag (see FIG. 6) is set (step S33).
Further, when the margin time is set as the setting information in the setting information storage unit 40, the determination unit 50 determines the total determination result in the time interval corresponding to the margin time before and after the scene in which the flag of the total determination result is set. Is set (step S34).
Through the operations from step S30 to step S34, the determination result as to whether or not the presence is high in the time interval in the content C is recorded in the presence estimation value / determination result storage means 30.
Returning to FIG. 8, the overall operation of the scene extraction apparatus 1 will be described.

ステップＳ３による時間区間単位での臨場感の判定後、シーン抽出装置１は、抽出手段６０によって、コンテンツＣから臨場感の高いシーンを抽出する（ステップＳ４）。
すなわち、抽出手段６０は、臨場感推定値／判定結果記憶手段３０の判定結果（具体的には、総合判定結果）を参照して、臨場感が高いと判定された（フラグがセットされた）時間区間（連続時間区間）のシーンをコンテンツＣから抽出し、表示装置２に出力する。
以上説明した動作によって、シーン抽出装置１は、コンテンツＣの客観的な特徴量ではなく、視聴者の主観的な印象に基づいて、臨場感の高いシーンを抽出することができる。 After determining the presence in units of time in step S3, the scene extracting apparatus 1 extracts a scene with high presence from the content C by the extracting means 60 (step S4).
That is, the extracting unit 60 refers to the determination result (specifically, the comprehensive determination result) of the estimated presence value / determination result storage unit 30 and determines that the presence is high (a flag is set). A scene in the time interval (continuous time interval) is extracted from the content C and output to the display device 2.
Through the operation described above, the scene extraction device 1 can extract a scene with high presence based on the subjective impression of the viewer, not the objective feature amount of the content C.

≪第２実施形態≫
次に、図１１を参照して、本発明の第２実施形態に係るシーン抽出装置１Ｂの構成について説明する。 << Second Embodiment >>
Next, the configuration of the scene extraction device 1B according to the second embodiment of the present invention will be described with reference to FIG.

シーン抽出装置１Ｂは、コンテンツＣから、臨場感の高いシーンを抽出するものである。このシーン抽出装置１Ｂは、抽出したシーンを表示装置２に出力する。
図１で説明したシーン抽出装置１は、複数の臨場感推定値から１つの臨場感推定値を特定（算出）して、ある時間区間における臨場感推定値を１つ保持することとした。一方、シーン抽出装置１Ｂは、ある時間区間における臨場感推定値を複数保持し、外部から重みを設定されることで、高臨場感のシーンを抽出する際にどの臨場感を高めたシーンを抽出するかを設定可能としている。 The scene extraction device 1B extracts a highly realistic scene from the content C. The scene extraction device 1B outputs the extracted scene to the display device 2.
The scene extracting apparatus 1 described with reference to FIG. 1 specifies (calculates) one realistic feeling estimated value from a plurality of realistic feeling estimated values and holds one realistic feeling estimated value in a certain time interval. On the other hand, the scene extraction apparatus 1B holds a plurality of realistic sensation estimation values in a certain time interval, and by setting a weight from the outside, a scene with a higher sensation of realism is extracted when a highly realistic scene is extracted. It is possible to set whether to do.

シーン抽出装置１Ｂは、臨場感推定手段１０Ｂと、臨場感学習データ記憶手段２０と、臨場感推定値／判定結果記憶手段３０Ｂと、設定情報記憶手段４０Ｂと、判定手段５０Ｂと、抽出手段６０と、を備える。臨場感学習データ記憶手段２０および抽出手段６０は、シーン抽出装置１（図１参照）と同じ構成であるため、説明を省略する。 The scene extracting apparatus 1B includes a realistic sensation estimation unit 10B, a realistic sensation learning data storage unit 20, a realistic sensation estimated value / determination result storage unit 30B, a setting information storage unit 40B, a determination unit 50B, and an extraction unit 60. . The presence learning data storage unit 20 and the extraction unit 60 have the same configuration as that of the scene extraction device 1 (see FIG. 1), and thus description thereof is omitted.

臨場感推定手段１０Ｂは、コンテンツＣから時間単位で臨場感の度合いを示す臨場感推定値を算出するものである。ここでは、臨場感推定手段１０Ｂは、臨場感推定手段１０（図１参照）から、臨場感特定手段１４を除いたものである。
この臨場感推定手段１０Ｂは、聴覚臨場感推定手段１１、視覚臨場感推定手段１２および感覚臨場感推定手段１３が、それぞれ推定した個別臨場感推定値（聴覚臨場感推定値、視覚臨場感推定値、感覚臨場感推定値）を、臨場感推定値／判定結果記憶手段３０Ｂに記憶する。 The realistic sensation estimation means 10B calculates a realistic sensation estimated value indicating the degree of realistic sensation in time units from the content C. Here, the realistic sensation estimation means 10B is obtained by removing the realistic sensation specifying means 14 from the realistic sensation estimation means 10 (see FIG. 1).
This realistic sensation estimation means 10B is an auditory realistic sensation estimation means 11, a visual presence sensation estimation means 12, and a sensory presence sensation estimation means 13 that are estimated by the individual presence sensation estimation values (audience realistic sensation estimation values, visual presence sensation estimation values). , Sensory presence sense value) is stored in the presence sense value / determination result storage means 30B.

臨場感推定値／判定結果記憶手段３０Ｂは、臨場感推定手段１０Ｂで推定されたコンテンツＣの時間区間ごとの個別臨場感推定値を、コンテンツＣの先頭からの経過時間に対応付けて記憶するものである。
さらに、臨場感推定値／判定結果記憶手段３０Ｂは、各時間区間の個別臨場感推定値が視聴者に対して臨場感を与えるか否かについて、判定手段５０Ｂによって判定した結果（例えば、フラグ）を記録する領域を有する。
例えば、臨場感推定値／判定結果記憶手段３０Ｂは、図１２に示すように、時間（先頭からの経過時間）、臨場感推定値、判定結果（総合判定結果、閾値判定結果、順位）を記録する領域を有する。時間および臨場感推定値は、臨場感推定手段１０Ｂによって書き込まれ、判定結果は、判定手段５０Ｂによって書き込まれる。なお、臨場感推定値として、聴覚臨場感推定値、視覚臨場感推定値、感覚臨場感推定値が個別に記憶される。
この臨場感推定値／判定結果記憶手段３０Ｂは、半導体メモリ等の一般的な記録媒体で構成することができる。 The realistic sensation estimated value / determination result storage means 30B stores the individual realistic sensation estimated value for each time interval of the content C estimated by the realistic sensation estimation means 10B in association with the elapsed time from the beginning of the content C. It is.
Further, the realistic presence estimated value / determination result storage unit 30B determines whether or not the individual realistic presence estimated value in each time interval gives the viewer a sense of realism by the determining unit 50B (for example, a flag). Has an area for recording.
For example, as shown in FIG. 12, the realistic sensation estimated value / determination result storage unit 30B records time (elapsed time from the beginning), realistic sensation estimated value, and determination result (overall determination result, threshold determination result, rank). Have a region to The time and the realistic sensation estimated value are written by the realistic sensation estimation means 10B, and the determination result is written by the determination means 50B. Note that the auditory sense of reality, the visual sense of reality estimated value, and the sense of sense of presence realistic value are individually stored as the sense of presence estimated value.
The realistic sensation estimated value / determination result storage means 30B can be configured by a general recording medium such as a semiconductor memory.

設定情報記憶手段４０Ｂは、設定装置３で設定された各種情報を記憶するものであって、半導体メモリ等の一般的な記録媒体で構成することができる。ここでは、設定情報記憶手段４０Ｂは、設定装置３によって、閾値、抽出条件、糊代時間、重みが書き込まれる。 The setting information storage means 40B stores various information set by the setting device 3, and can be configured by a general recording medium such as a semiconductor memory. Here, in the setting information storage unit 40B, the threshold value, the extraction condition, the paste margin time, and the weight are written by the setting device 3.

重みは、臨場感推定手段１０Ｂで臨場感を推定するコンテンツの信号（音響信号、映像信号、感覚信号）に対する重み係数である。この重みは、臨場感推定手段１０Ｂで臨場感を推定するコンテンツの信号（音響信号、映像信号、感覚信号）に対して、各重みの総和が“１”となるような値とする。ここでは、設定装置３が、表示装置２に表示した設定画面を介して、重みを入力し、シーン抽出装置１に設定する。 The weight is a weighting coefficient for a content signal (acoustic signal, video signal, sensory signal) whose presence is estimated by the presence estimation means 10B. This weight is set to a value such that the sum of the weights is “1” with respect to a content signal (sound signal, video signal, sensory signal) whose presence is estimated by the presence estimation means 10B. Here, the setting device 3 inputs weights via the setting screen displayed on the display device 2 and sets them in the scene extraction device 1.

判定手段５０Ｂは、設定情報記憶手段４０Ｂに記憶されている設定情報に基づいて、臨場感推定手段１０Ｂで臨場感推定値が対応付けられた時間区間のシーンの抽出を行うか否かを判定するものである。この判定手段５０Ｂの機能は、基本的に、判定手段５０（図１参照）と同じである。ただし、判定手段５０Ｂは、時間区間ごとに臨場感を判定するために、設定情報記憶手段４０Ｂに記憶されている重みによって、個別臨場感推定値（聴覚臨場感推定値、視覚臨場感推定値、感覚臨場感推定値）を重み付け加算した値と閾値とを比較する点が異なる。 Based on the setting information stored in the setting information storage unit 40B, the determination unit 50B determines whether or not to extract a scene in the time interval associated with the realistic sensation estimation value in the realistic sensation estimation unit 10B. Is. The function of the determination unit 50B is basically the same as that of the determination unit 50 (see FIG. 1). However, the determination unit 50B determines the individual sense of presence (audience realistic sense estimate, visual sense of real sense, value based on the weight stored in the setting information storage means 40B in order to determine the sense of realism for each time interval. The difference is that a threshold value is compared with a value obtained by weighted addition of (sensory realistic sense estimated value).

すなわち、判定手段５０Ｂは、臨場感推定値／判定結果記憶手段３０Ｂに記憶されている時間区間ごとの個別臨場感推定値（聴覚臨場感推定値、視覚臨場感推定値、感覚臨場感推定値）を、設定装置３で設定された重みによって重み付け加算した値を、当該時間区間の臨場感推定値とする。そして、判定手段５０Ｂは、この重み付け加算値を、臨場感推定値として、判定手段５０（図１参照）と同様の判定手法によって、臨場感が高いと判定された時間区間に判定結果として閾値判定結果のフラグをセットし、臨場感が高くないと判定された時間区間では閾値判定結果のフラグをリセットする（図１２参照）。 In other words, the determination unit 50B is an individual presence estimation value for each time interval stored in the realistic presence estimated value / determination result storage unit 30B (auditory presence estimation value, visual presence estimation value, sensory presence estimation value). Is weighted and added with the weight set by the setting device 3 as the estimated realistic sensation value for the time interval. Then, the determination unit 50B uses the weighted addition value as the realistic presence estimated value as a determination result in the time interval determined to be high by the same determination method as the determination unit 50 (see FIG. 1). The result flag is set, and the threshold determination result flag is reset in the time interval in which it is determined that the presence is not high (see FIG. 12).

以上説明したようにシーン抽出装置１Ｂを構成することで、シーン抽出装置１Ｂは、コンテンツＣの客観的な特徴量ではなく、視聴者の主観的な印象に基づいて、臨場感の高いシーンを抽出することができる。さらに、シーン抽出装置１Ｂは、複数の臨場感の重みを、臨場感を推定した後に変えることができるため、コンテンツＣから、適宜重み付けを変えて、視聴者が所望のシーンを抽出することができる。 By configuring the scene extraction device 1B as described above, the scene extraction device 1B extracts a scene with high presence based on the subjective impression of the viewer, not the objective feature amount of the content C. can do. Furthermore, since the scene extraction device 1B can change the weights of a plurality of realistic sensations after estimating the realistic sensations, the viewer can extract a desired scene from the content C by appropriately changing the weights. .

以上、本発明の実施形態について説明したが、シーン抽出装置１，１Ｂは、コンピュータを、前記した各手段として機能させるシーン抽出プログラムで動作させることができる。
また、本発明は、これらの実施形態に限定されるものではない。以下、本発明の変形例について説明する。 Although the embodiment of the present invention has been described above, the scene extraction apparatuses 1 and 1B can be operated by a scene extraction program that causes a computer to function as each of the above-described means.
Further, the present invention is not limited to these embodiments. Hereinafter, modifications of the present invention will be described.

≪変形例１≫
シーン抽出装置１，１Ｂは、設定装置３で、臨場感が高いか否かの判定基準となる閾値を１つ設定することとした。しかし、この閾値は、臨場感の高さのレベルを設けて複数設定することとしてもよい。例えば、「臨場感が非常に高い」、「臨場感がやや高い」等のレベルによって、閾値を複数設定することとする。 << Modification 1 >>
In the scene extraction devices 1 and 1B, the setting device 3 sets one threshold value that is a criterion for determining whether or not the presence is high. However, a plurality of threshold values may be set by providing a high level of presence. For example, a plurality of threshold values are set according to levels such as “very high sense of reality” and “highly realistic”.

この場合、判定手段５０，５０Ｂは、臨場感推定値／判定結果記憶手段３０に臨場感の判定結果を設定する際に、臨場感の高さのレベルに応じた判定結果（総合判定結果、閾値判定結果）を記憶すればよい。例えば、「臨場感が非常に高い」を示す閾値以上の臨場感推定値が推定された時間区間については、判定結果を“２”、「臨場感が非常に高い」を示す閾値未満で「臨場感がやや高い」を示す閾値以上の臨場感推定値が推定された時間区間については、判定結果を“１”のように区別する。
そして、設定装置３によってレベルを指定されることで、抽出手段６０が、そのレベルが設定されている時間区間に対応するシーンを抽出する。
これによって、視聴者は、臨場感の大きさをカテゴライズして、臨場感の異なるシーンを視聴することができる。 In this case, when the determination means 50 and 50B set the determination result of the realistic sensation in the realistic sensation estimated value / determination result storage means 30, the determination result (the comprehensive determination result, the threshold value) according to the level of the realistic sensation level. (Determination result) may be stored. For example, for a time interval in which an estimated value of realistic sensation equal to or greater than the threshold value indicating “very high sense of reality” is used, the determination result is “2”, and the value is less than the threshold value indicating “very high sense of reality”. The determination result is distinguished as “1” for the time interval in which the realistic sensation estimated value equal to or greater than the threshold indicating “slightly high” is estimated.
Then, when the level is designated by the setting device 3, the extraction unit 60 extracts a scene corresponding to the time interval in which the level is set.
Thus, the viewer can categorize the magnitude of the presence and view a scene with a different presence.

≪変形例２≫
シーン抽出装置１，１Ｂは、抽出手段６０によって、抽出したシーンを表示装置２に表示することとした。しかし、抽出手段６０は、一旦、抽出する各シーンの先頭フレームをサムネイル画像として、表示装置２に表示し、視聴者が選択したシーンのみを、コンテンツＣから抽出して表示することとしてもよい。また、このサムネイル画像を表示する順序は、時刻順であっても、臨場感推定値の高い順であってもよい。
これによって、視聴者は、高臨場感のシーンの中からシーンを選択して視聴することができる。 << Modification 2 >>
The scene extraction devices 1 and 1B display the extracted scene on the display device 2 by the extraction means 60. However, the extraction means 60 may once display the first frame of each scene to be extracted as a thumbnail image on the display device 2 and extract and display only the scene selected by the viewer from the content C. Further, the order in which the thumbnail images are displayed may be in the order of time or in the order of the higher realistic estimate value.
Thus, the viewer can select and view a scene from highly realistic scenes.

≪変形例３≫
シーン抽出装置１，１Ｂは、聴覚臨場感推定手段１１において、コンテンツＣの音響信号から、臨場感推定値を算出することとした。しかし、聴覚臨場感推定手段１１は、音響特徴量から、音響信号が主に人の音声であるか否かを判定し、設定により、人の音声の度合いが高い場合に聴覚臨場感推定値を高める、あるいは、人の音声の度合いが低い場合に聴覚臨場感推定値を低くすることとしてもよい。
これによって、シーン抽出装置１，１Ｂは、対人活動によって臨場感が高くなりシーンを抽出するのか、対人活動以外の環境によって臨場感が高くなるシーンを抽出するのかを区別してシーンを抽出することができる。 << Modification 3 >>
In the scene extraction devices 1 and 1B, the auditory realistic sensation estimation unit 11 calculates the realistic sensation estimated value from the sound signal of the content C. However, the auditory realistic sensation estimation means 11 determines whether or not the acoustic signal is mainly human speech from the acoustic feature quantity, and the auditory realistic sensation estimated value is determined when the human speech level is high by the setting. Alternatively, the auditory realistic sensation estimated value may be lowered when the degree of human voice is low.
As a result, the scene extraction devices 1 and 1B can extract the scene by distinguishing whether to extract a scene that has a high sense of realism due to interpersonal activities or to extract a scene that has a high sense of realism due to an environment other than interpersonal activities. it can.

≪変形例４≫
シーン抽出装置１，１Ｂは、臨場感推定手段１０Ｂによって、臨場感推定値を臨場感推定値／判定結果記憶手段３０に記憶することとした。
しかし、臨場感推定手段１０Ｂは、臨場感以外にさらに感情の度合いを推定することとしてもよい。この場合、予め臨場感学習データ記憶手段２０に、学習によって、音響信号、映像信号、感覚信号ごとに、ワクワクするシーンの度合い、ジーンとするシーンの度合い、快適であるシーンの度合い、活動的であるシーンの度合い等を学習しておく。これによって、臨場感推定手段１０Ｂは、各推定手段１１，１２，１３において、臨場感以外に感情の度合い（感情推定値）を推定する。
そして、設定装置３によって、臨場感の重みに加え、感情の度合いの推定値に対する重みを設定されることで、判定手段５０Ｂは、臨場感推定値と感情推定値とを重み付け加算する。
これによって、シーン抽出装置１Ｂは、臨場感に加え、感情の高ぶりが高くなるシーンを抽出することができる。 << Modification 4 >>
The scene extraction devices 1 and 1B store the realistic sensation estimated value in the realistic sensation estimated value / determination result storage unit 30 by the realistic sensation estimating unit 10B.
However, the realistic sensation estimation means 10B may further estimate the degree of emotion other than the realistic sensation. In this case, the realistic learning data storage means 20 is preliminarily trained for each of the sound signal, video signal, and sensory signal, the degree of the exciting scene, the degree of the scene to be gene, the degree of the comfortable scene, Learn the degree of a scene. Accordingly, the realistic sensation estimation means 10B estimates the degree of emotion (emotion estimated value) in addition to the realism in each of the estimation means 11, 12, and 13.
Then, the setting device 3 sets a weight for the estimated value of the feeling level in addition to the realistic weight, so that the determination unit 50B weights and adds the estimated presence value and the estimated emotion value.
As a result, the scene extraction apparatus 1B can extract a scene in which the emotional height increases in addition to the presence.

≪変形例５≫
シーン抽出装置１，１Ｂは、臨場感推定手段１０，１０Ｂにおいて、聴覚臨場感、視覚臨場感および感覚臨場感を個別に推定した。
しかし、臨場感推定手段１０，１０Ｂは、臨場感を個別に推定するのではなく、一度に代表する臨場感を推定することとしてもよい。この場合、臨場感学習データ記憶手段２０には、複数の信号（音響信号、映像信号および感覚信号）の特徴量から１つの臨場感を推定するモデルを、ニューラルネットワーク等で予め学習しておく。そして、臨場感推定手段１０，１０Ｂは、コンテンツＣを構成する複数の信号（音響信号、映像信号および感覚信号）の特徴量から、ニューラルネットワークにより各信号を代表する臨場感を推定すればよい。 << Modification 5 >>
The scene extraction devices 1 and 1B individually estimated auditory presence, visual presence, and sensory presence in the presence estimation means 10 and 10B.
However, the realistic sensation estimation means 10 and 10B may estimate the realistic sensation represented at a time instead of estimating the realistic sensation individually. In this case, in the realistic sense learning data storage unit 20, a model for estimating one realistic sense from feature quantities of a plurality of signals (sound signals, video signals, and sensory signals) is previously learned using a neural network or the like. Then, the realistic sensation estimation means 10 and 10B may estimate the realistic sensation representing each signal from the feature amounts of a plurality of signals (acoustic signal, video signal, and sensory signal) constituting the content C using a neural network.

１シーン抽出装置
１０臨場感推定手段
１１聴覚臨場感推定手段
１２視覚臨場感推定手段
１３感覚臨場感推定手段
１４臨場感特定手段
２０臨場感学習データ記憶手段
３０臨場感推定値／判定結果記憶手段
（臨場感推定値記憶手段、判定結果記憶手段）
４０設定情報記憶手段
５０判定手段
６０抽出手段
２表示装置
３設定装置 DESCRIPTION OF SYMBOLS 1 Scene extraction device 10 Realistic feeling estimation means 11 Auditory realistic feeling estimation means 12 Visual realistic feeling estimation means 13 Sensory realistic feeling estimation means 14 Realistic feeling specification means 20 Realistic feeling learning data storage means 30 Realistic feeling estimated value / judgment result storage means ( (Presence sense storage means, judgment result storage means)
40 setting information storage means 50 determination means 60 extraction means 2 display device 3 setting device

Claims

A scene extraction device that extracts a highly realistic scene from content by a sense of presence defined by learning a predetermined feature amount of the content,
Presence learning data storage means for storing presence learning data in which the degree of presence with respect to the feature amount of the signal constituting the content is previously learned;
A setting information storage means for storing, as setting information, a time interval and a time width of the signal for analyzing at least the presence, and a threshold value that is a criterion for determining whether or not the presence is high;
The feature amount of the content is analyzed for each time interval of the predetermined time width that includes an overlapping interval and is shifted by the predetermined time interval, and the degree of presence is estimated from the presence learning data Presence estimation means to calculate as a value;
A sense of presence estimated value storing means for storing the sense of presence estimated value in association with the time interval;
Determining means for determining whether or not the presence of the scene in the time interval associated with the estimated presence value is high based on the preset threshold;
Determination result storage means for storing the determination result of the determination means;
Extraction means for extracting a scene corresponding to a time interval determined to have high presence stored in the determination result storage means from the content;
A scene extraction apparatus comprising:

The presence estimation means calculates an individual presence estimation value for each signal constituting the content, and uses the maximum value, average value, or weighted addition value as the presence estimation value. Item 2. The scene extraction device according to Item 1.

The realistic sensation estimation means calculates an individual realistic sensation estimated value for each signal constituting the content and stores it in the realistic sensation estimated value storage means,
The determination means calculates a maximum value, an average value, or a weighted addition value of the plurality of individual realistic sensation estimated values for each time interval as the realistic sensation estimated value, and based on the preset threshold value The scene extraction device according to claim 1, wherein it is determined whether or not the value is high.

As the setting information, a time length or a ratio of a scene determined to be high in the content is stored in the setting information storage unit in advance.
The said determination means uses the realistic feeling estimated value which reaches the said time length or the said ratio in an order from the time interval with the said high realistic feeling estimated value as the said threshold value, The any one of Claim 1 to 3 characterized by the above-mentioned. The scene extraction device according to item.

The threshold value is a plurality of values determined in advance according to the height of the realistic sensation,
The determination means sets the level of the level of presence in the time interval by comparing the estimated value of presence in the time interval and a plurality of the threshold values in the time interval,
4. The scene according to claim 1, wherein the extraction unit extracts a scene corresponding to the time interval in which a level designated from the outside is set from the content. 5. Extraction device.

As the setting information, the shortest scene extraction time is previously stored in the setting information storage means,
The scene extraction apparatus according to any one of claims 1 to 5, wherein the extraction unit extracts only scenes having a time longer than the shortest scene extraction time as the highly realistic scenes.

As the setting information, the maximum scene extraction number is stored in advance in the setting information storage means,
The scene extraction apparatus according to any one of claims 1 to 6, wherein the extraction unit extracts a scene with the maximum number of extracted scenes as an upper limit as the scene with high presence.

As the setting information, paste margin time is previously stored in the setting information storage means,
8. The extraction unit according to claim 1, wherein the extraction unit extracts, from the content, a scene corresponding to a time interval in which the pasting time is added to a time interval in which the presence is determined to be high. The scene extraction device according to any one of claims.

The realistic sensation estimation means calculates an emotion estimation value for each of the time intervals from the content by learning data in which the feature amount of the content and the degree of emotion are learned in advance.
The said determination means determines whether the said presence is high by the weighted addition value of the said presence feeling estimated value and the said emotion estimated value. The scene extraction device described in 1.

A scene extraction program for causing a computer to function as each unit of the scene extraction device according to any one of claims 1 to 9.