JP2006237791A

JP2006237791A - Information processing apparatus and program

Info

Publication number: JP2006237791A
Application number: JP2005046716A
Authority: JP
Inventors: Hisashi Koseki; 悠小関; Yasuyuki Sumi; 康之角; Toyoaki Nishida; 豊明西田; Kenji Mase; 健二間瀬
Original assignee: ATR Advanced Telecommunications Research Institute International
Current assignee: ATR Advanced Telecommunications Research Institute International
Priority date: 2005-02-23
Filing date: 2005-02-23
Publication date: 2006-09-07

Abstract

【課題】従来の情報処理装置においては、動画から所定の静止画を適切に抽出して出力できないという課題があった。
【解決手段】映像を有する映像情報と音声を有する音声情報を含むコンテンツを１以上格納しているコンテンツ格納部１０１と、前記音声に基づいて、発話している箇所である発話箇所を検出する発話箇所検出部１０２と、前記発話箇所検出部１０２が検出した発話箇所を構成する１以上の静止画を前記映像から抽出する静止画抽出部１０４と、前記静止画抽出部１０４が抽出した１以上の静止画を１以上のほぼ重ならないウィンドウに出力する出力部１０８を具備する情報処理装置により、動画から所定の静止画を適切に抽出して出力できる。
【選択図】図１
A conventional information processing apparatus has a problem that a predetermined still image cannot be appropriately extracted from a moving image and output.
A content storage unit that stores at least one content including video information having video and audio information having audio, and an utterance that detects an utterance location that is an utterance location based on the audio. A location detection unit 102; a still image extraction unit 104 that extracts one or more still images constituting the utterance location detected by the utterance location detection unit 102; and one or more extracted by the still image extraction unit 104 A predetermined still image can be appropriately extracted from the moving image and output by the information processing apparatus including the output unit 108 that outputs the still image to one or more non-overlapping windows.
[Selection] Figure 1

Description

本発明は、動画から所定の静止画を抽出して出力する情報処理装置等に関するものである。 The present invention relates to an information processing apparatus that extracts and outputs a predetermined still image from a moving image.

従来の動画等を要約する動画要約方法において、動画内の意味的に重要なイベントを自動抽出することが可能な動画要約方法が存在する（例えば、特許文献１参照）。本動画要約方法は、動画の特徴を抽出する特徴抽出器４０と、特徴をインテグレートし境界を決定するための隠れマルコフモデルなどのモデルを使用する確率モデル４２と、コマーシャル及び非コマーシャルスローモーション再生セグメントを区別するコマーシャル／非コマーシャルフィルタ４４と、検出したスローモーション再生セグメントに基づき要約を生成する要約生成器４６とを含む。本動画要約方法における特徴抽出器４０は、ブロック５０でカラーヒストグラムから特徴を抽出し、ブロック５２で画素に基づく差から３つの特徴を抽出する。ブロック５２で抽出した特徴は、再生セグメントのスローモーション，静止フィールド，及び／又はノーマル速度再生の各構成成分を特徴づける。本動画要約方法におけるブロック５０で抽出した特徴は編集効果成分を特徴付ける。
また、本実施の形態で述べる動画や音声を含むコンテンツの取得方法に関して、非特許文献１において開示されている。
特開２００２−２３２８４０号公報（第１頁、第１図等）角康之他９名、"ユビキタス環境における体験の記録と共有"、システム／制御／情報（システム制御情報学会誌）、２００４年１１月、Ｖｏｌ．４８，Ｎｏ．１，ｐｐ．４５８−４６３ In a conventional moving image summarizing method for summarizing moving images and the like, there is a moving image summarizing method capable of automatically extracting semantically important events in a moving image (for example, see Patent Document 1). The video summarization method includes a feature extractor 40 that extracts video features, a probability model 42 that uses a model such as a hidden Markov model to integrate features and determine boundaries, and commercial and non-commercial slow motion playback segments. A commercial / non-commercial filter 44 and a summary generator 46 that generates a summary based on the detected slow motion playback segment. The feature extractor 40 in this moving image summarization method extracts features from the color histogram in block 50 and extracts three features from the pixel-based differences in block 52. The features extracted at block 52 characterize each component of the playback segment's slow motion, still field, and / or normal speed playback. The feature extracted in block 50 in the moving image summarizing method characterizes the editing effect component.
Further, Non-Patent Document 1 discloses a method for acquiring content including moving images and audio described in the present embodiment.
JP 2002-232840 A (first page, FIG. 1 etc.) Yasuyuki Kaku et al., “Recording and Sharing Experiences in Ubiquitous Environment”, System / Control / Information (Journal of System Control Information Society), November 2004, Vol. 48, no. 1, pp. 458-463

しかしながら、従来の動画要約方法を実現する情報処理装置においては、動画の中から適切に静止画を切り出して、効果的に出力することができないという課題があった。 However, the information processing apparatus that implements the conventional moving image summarization method has a problem that a still image cannot be appropriately cut out from the moving image and output effectively.

本第一の発明の情報処理装置は、映像を有する映像情報と音声を有する音声情報を含むコンテンツを１以上格納しているコンテンツ格納部と、前記音声に基づいて、発話している箇所である発話箇所を検出する発話箇所検出部と、前記発話箇所検出部が検出した発話箇所を構成する１以上の静止画を前記映像から抽出する静止画抽出部と、前記静止画抽出部が抽出した１以上の静止画を１以上のほぼ重ならないウィンドウに出力する出力部を具備する情報処理装置である。 The information processing apparatus according to the first aspect of the invention is a content storage unit that stores at least one content including video information having video and audio information having audio, and a portion that speaks based on the audio An utterance location detection unit that detects an utterance location, a still image extraction unit that extracts one or more still images constituting the utterance location detected by the utterance location detection unit, and a 1 extracted by the still image extraction unit The information processing apparatus includes an output unit that outputs the above still image to one or more windows that do not substantially overlap.

かかる構成により、会話や発言している場など、着目すべき場面を構成する静止画を自動的に抽出し、多数のウィンドウに配置することにより、例えば、イベントに参加した者の行動を要約する電子アルバムができる。 With this configuration, for example, the actions of those who participated in the event are summarized by automatically extracting still images that make up the scene to be noticed, such as conversation or speaking, and placing them in many windows. An electronic album is available.

また、本第二の発明の情報処理装置は、映像を有する映像情報と音声を有する音声情報を含むコンテンツを１以上格納しているコンテンツ格納部と、前記映像に写っている２以上のオブジェクトが対向していることを検知する対向箇所検出部と、前記対向箇所検出部が検出した対向箇所を構成する映像の中から、１以上の静止画を抽出する静止画抽出部と、前記静止画抽出部が抽出した１以上の静止画を１以上のほぼ重ならないウィンドウに出力する出力部を具備する情報処理装置である。 The information processing apparatus according to the second aspect of the invention includes a content storage unit that stores at least one content including video information having video and audio information having audio, and two or more objects reflected in the video. A facing part detection unit that detects that they are facing each other; a still image extraction unit that extracts one or more still images from the video that configures the facing part detected by the facing part detection unit; and the still image extraction The information processing apparatus includes an output unit that outputs one or more still images extracted by the unit to one or more non-overlapping windows.

かかる構成により、オブジェクトに対向しており、主人公が見ている場など、着目すべき場面を構成する静止画を自動的に抽出し、多数のウィンドウに配置することにより、例えば、イベントに参加した者の行動を要約する電子アルバムができる。 With this configuration, for example, you participated in an event by automatically extracting still images that make up the scene you should pay attention to, such as the place where the main character is viewing, facing the object, and placing it in many windows. An electronic album summarizing the behavior of the person.

また、本第三の発明の情報処理装置は、第一の発明に対して、前記静止画抽出部は、発話箇所または／および対向箇所を構成する１以上の静止画を前記映像から繰り返し抽出し、前記出力部は、前記１以上のウィンドウに、前記静止画抽出部が抽出した静止画を切り換えながら出力する情報処理装置である。
かかる構成により、多数の場面を切り出した際に、狭いディスプレイでも、効果的に一連の行動が概観できる。 In the information processing apparatus according to the third aspect of the present invention, in contrast to the first aspect, the still image extraction unit repeatedly extracts one or more still images constituting an utterance location or / and an opposite location from the video. The output unit is an information processing apparatus that outputs the one or more windows while switching the still image extracted by the still image extraction unit.
With this configuration, when a large number of scenes are cut out, a series of actions can be effectively overviewed even on a narrow display.

また、本第四の発明の情報処理装置は、第一から第三の発明に対して、前記出力部は、前記静止画抽出部が抽出した静止画を出力する静止画出力手段と、前記静止画に重ねて吹き出しを出力する吹出出力手段を具備する情報処理装置である。
かかる構成により、切り出した複数の静止画を、漫画のように閲覧ができる。 Further, in the information processing apparatus according to the fourth aspect of the present invention, as compared with the first to third aspects, the output unit includes a still image output unit that outputs the still image extracted by the still image extraction unit, and the still image It is an information processing apparatus including a blowout output unit that outputs a blowout over a picture.
With this configuration, it is possible to view a plurality of cut out still images like a comic.

また、本第五の発明の情報処理装置は、第四の発明に対して、前記吹出出力手段は、前記静止画を有する映像に対応する音声を分析し、当該分析結果に応じて２種類以上の形状の吹き出しを区別して出力する情報処理装置である。
かかる構成により、吹き出しを見るだけで、どのような場面であったかを容易に把握できる。 Further, in the information processing apparatus according to the fifth aspect of the invention, in contrast to the fourth aspect of the invention, the blowout output means analyzes the sound corresponding to the video having the still image, and two or more types according to the analysis result It is an information processing apparatus which distinguishes and outputs the balloon of the shape.
With such a configuration, it is possible to easily grasp what kind of scene it was just by looking at the balloon.

また、本第六の発明の情報処理装置は、第三、第四の発明に対して、前記吹出出力手段は、前記静止画を有する映像に対応する音声を分析し、発話の長さを取得し、当該長さに応じた長さを有する文字列を吹き出し内に出力する情報処理装置である。
かかる構成により、吹き出し内の文字列を見るだけで、会話や発声の長さが、直感的に把握できる。 Further, in the information processing apparatus according to the sixth invention, in contrast to the third and fourth inventions, the blowing output means analyzes the voice corresponding to the video having the still image and acquires the length of the utterance. The information processing apparatus outputs a character string having a length corresponding to the length in a balloon.
With such a configuration, it is possible to intuitively grasp the length of conversation or utterance simply by looking at the character string in the balloon.

また、本第七の発明の情報処理装置は、第四から第六の発明に対して、前記映像を分析し、オブジェクトの少なくとも周辺の位置を検出する位置検出部をさらに具備し、前記吹出出力手段は、前記位置検出部が検出した位置周辺に吹き出しを出力する情報処理装置である。
かかる構成により、静止画に写っている人があたかも話しているような吹き出しを表示できる。 In addition, the information processing apparatus according to the seventh aspect of the invention further includes a position detection unit that analyzes the video and detects a position of at least the periphery of the object, as compared with the fourth to sixth aspects. The means is an information processing apparatus that outputs a balloon around the position detected by the position detection unit.
With this configuration, it is possible to display a speech balloon as if a person in a still image is talking.

また、本第八の発明の情報処理装置は、第四から第七の発明に対して、前記映像と前記音声を分析し、出力する静止画に対応する音声が、静止画に現れるユーザではないユーザの音声であることを検出する非表示ユーザ発声検出部をさらに具備し、前記吹出出力手段は、前記非表示ユーザ発声検出部が静止画に現れるユーザではないユーザの音声であることを検出した場合、前記ウィンドウの外または隅から吹き出しが現れる態様で吹き出しを出力する情報処理装置である。
かかる構成により、静止画中に写っていない人がお話していることを直感的に把握できる。 Further, the information processing apparatus according to the eighth aspect of the invention is not a user who analyzes the video and the sound and outputs the sound corresponding to the output still picture in the still picture, as compared with the fourth to seventh aspects of the invention. The non-display user utterance detection unit further detects that the voice is a user's voice, and the blowing output unit detects that the non-display user utterance detection unit is a voice of a user who is not a user who appears in a still image. In this case, the information processing apparatus outputs a speech balloon in such a manner that the speech balloon appears from the outside or corner of the window.
With this configuration, it is possible to intuitively understand that a person who is not shown in the still image is talking.

また、本第九の発明の情報処理装置は、第一から第六、第八の発明に対して、前記映像を分析し、オブジェクトの少なくとも周辺の位置を検出する位置検出部をさらに具備し、前記出力部は、前記位置検出部が検出した位置の周辺の領域と、他の領域とで出力態様を変更して、前記静止画を出力する情報処理装置である。
かかる構成により、着目したいオブジェクトを着目できる態様で表示できる。
また、本第十の発明の情報処理装置は、第九の発明に対して、前記出力部は、前記位置検出部が検出した位置の周辺の領域はカラーで、かつ他の領域はモノクロで、前記静止画を出力する情報処理装置である。
かかる構成により、適切なハイライト表示ができる。 The information processing apparatus according to the ninth aspect of the invention further includes a position detection unit that analyzes the video and detects at least a peripheral position of the object with respect to the first to sixth and eighth aspects of the invention, The output unit is an information processing apparatus that outputs the still image by changing an output mode between a region around the position detected by the position detection unit and another region.
With this configuration, it is possible to display an object in which attention is desired in a manner that allows attention.
The information processing apparatus according to the tenth aspect of the present invention is the information processing apparatus according to the tenth aspect of the present invention, wherein the output section is a color area around the position detected by the position detection section, and the other area is monochrome. An information processing apparatus that outputs the still image.
With this configuration, appropriate highlight display can be performed.

また、本第十一の発明の情報処理装置は、第一から第十の発明に対して、前記映像情報は、映像と当該映像に表れるオブジェクトを識別するオブジェクト識別子を有し、前記音声情報は、音声と音声の発話者を識別するオブジェクト識別子を有し、前記発話箇所検出部は、前記音声の大きさが所定以上の大きさの箇所である発話箇所であり、一のオブジェクト識別子と対になる音声と、ほぼ連続する他のオブジェクト識別子と対になる音声を有する対話の箇所である発話箇所を検出する情報処理装置である。
かかる構成により、対話している箇所に対応する静止画の切り出しができる。 The information processing apparatus according to the eleventh aspect of the present invention is the information processing apparatus according to the eleventh aspect, wherein the video information has an object identifier for identifying a video and an object appearing in the video, and the audio information is , Having an object identifier for identifying a voice and a voice speaker, the utterance point detection unit is a utterance point where the volume of the voice is equal to or larger than a predetermined size, and is paired with one object identifier. And an utterance part that is a part of a conversation having a voice paired with another object identifier that is substantially continuous.
With this configuration, it is possible to cut out a still image corresponding to the part where the conversation is performed.

また、本第十二の発明の情報処理装置は、第十一の発明に対して、前記対話の箇所である発話箇所を構成するコンテンツを分析し、場面の種類を決定する場面種決定部をさらに具備し、前記出力部は、前記場面の種類に基づいて、前記位置検出部が検出した位置の周辺の領域の形状が異なる情報処理装置である。
かかる構成により、ハイライト形状を見れば、場面の種類が直感的に把握できる。 The information processing apparatus according to the twelfth aspect of the present invention is the information processing apparatus according to the eleventh aspect, further comprising: a scene type determining unit that analyzes the content that constitutes the utterance location that is the location of the dialogue and determines the type of scene. Further, the output unit is an information processing apparatus in which the shape of the area around the position detected by the position detection unit is different based on the type of the scene.
With this configuration, the type of scene can be intuitively grasped by looking at the highlight shape.

本発明による情報処理装置によれば、動画から適切な静止画を抽出し、概観できる。 According to the information processing apparatus of the present invention, an appropriate still image can be extracted from a moving image and overviewed.

以下、情報処理装置等の実施形態について図面を参照して説明する。なお、実施の形態において同じ符号を付した構成要素は同様の動作を行うので、再度の説明を省略する場合がある。
図１は、本実施の形態における情報処理装置のブロック図である。 Hereinafter, embodiments of an information processing apparatus and the like will be described with reference to the drawings. In addition, since the component which attached | subjected the same code | symbol in embodiment performs the same operation | movement, description may be abbreviate | omitted again.
FIG. 1 is a block diagram of an information processing apparatus according to this embodiment.

情報処理装置は、コンテンツ格納部１０１、発話箇所検出部１０２、対向箇所検出部１０３、静止画抽出部１０４、位置検出部１０５、場面種決定部１０６、非表示ユーザ発声検出部１０７、出力部１０８を具備する。
出力部１０８は、静止画出力手段１０８１、吹出出力手段１０８２を具備する。 The information processing apparatus includes a content storage unit 101, an utterance location detection unit 102, an opposite location detection unit 103, a still image extraction unit 104, a position detection unit 105, a scene type determination unit 106, a non-display user utterance detection unit 107, and an output unit 108. It comprises.
The output unit 108 includes a still image output unit 1081 and a blowout output unit 1082.

コンテンツ格納部１０１は、映像を有する映像情報と音声を有する音声情報を含むコンテンツを１以上格納している。コンテンツは、例えば、映像情報と、音声情報と、映像が捕らえたオブジェクトを識別するオブジェクト識別子を有する。オブジェクトとは、例えば、人や展示物や展示パネルなどである。映像情報は、例えば、映像と当該映像を撮影した人を識別するオブジェクト識別子を有する。また、映像情報は、例えば、映像と当該映像を撮影するカメラが設置されている展示物を識別するオブジェクト識別子を有する。音声情報は、音声と音声の発話者を識別するオブジェクト識別子を有する。かかるコンテンツは、例えば、以下の情報取得装置により取得される。情報取得装置の例を図２に示す。図２において、ユーザ（オブジェクト）の耳の上部に「ＣＣＤカメラ」「赤外線ＩＤタグ」「赤外線センサ」を具備する。「ＣＣＤカメラ」は、映像を取得する。「赤外線ＩＤタグ」は、本ユーザのオブジェクト識別子を示す信号を重畳した赤外線信号を発信する。「赤外線センサ」は、外部からの赤外線信号を受信する。つまり、「赤外線センサ」は、「ＩＲトラッカ」である。「ＩＲトラッカ」は、対向するオブジェクトの「赤外線ＩＤタグ」から発信される信号を受信し、オブジェクト識別子を得る。そして、口元に「マイク」、および喉元に「スロート・マイク」を有する。また、目の前にＨＭＤ（ヘッド・マウント・ディスプレイ）を具備する。そして、「ＣＣＤカメラ」の信号は、ユーザの背中の背負われたＰＣが取得し、ＰＣから、本情報処理装置に送信される構成である。さらに、ＨＭＤはユーザの居る位置や、閲覧した展示物や対向した人（オブジェクト）に関する情報を出力するために利用する。なお、コンテンツの方法として、図２の情報取得装置は、一例であることは言うまでもない。コンテンツ格納部１０１は、不揮発性の記録媒体が好適であるが、揮発性の記録媒体でも実現可能である。不揮発性の記録媒体でも、揮発性の記録媒体でも良い。 The content storage unit 101 stores one or more contents including video information including video and audio information including audio. The content includes, for example, video information, audio information, and an object identifier that identifies an object captured by the video. The object is, for example, a person, an exhibit, an exhibition panel, or the like. The video information includes, for example, an object identifier that identifies the video and the person who has taken the video. Also, the video information includes, for example, an object identifier that identifies a video and an exhibit in which a camera that captures the video is installed. The voice information has an object identifier for identifying voice and a voice speaker. Such content is acquired by, for example, the following information acquisition device. An example of the information acquisition apparatus is shown in FIG. In FIG. 2, a “CCD camera”, an “infrared ID tag”, and an “infrared sensor” are provided above the ears of the user (object). The “CCD camera” acquires an image. The “infrared ID tag” transmits an infrared signal on which a signal indicating the object identifier of the user is superimposed. The “infrared sensor” receives an infrared signal from the outside. That is, the “infrared sensor” is an “IR tracker”. The “IR tracker” receives a signal transmitted from the “infrared ID tag” of the opposing object, and obtains an object identifier. It has a “microphone” at the mouth and a “throat microphone” at the throat. In addition, an HMD (head mounted display) is provided in front of the eyes. The signal of the “CCD camera” is acquired by the PC carrying the back of the user and transmitted from the PC to the information processing apparatus. Further, the HMD is used to output information on the position of the user, the displayed exhibits and the people (objects) facing the user. Needless to say, the information acquisition apparatus in FIG. 2 is an example of a content method. The content storage unit 101 is preferably a non-volatile recording medium, but can also be realized by a volatile recording medium. It may be a non-volatile recording medium or a volatile recording medium.

発話箇所検出部１０２は、音声に基づいて、発話している箇所である発話箇所を検出する。発話箇所は、通常、音声の大きさが所定以上の大きさの箇所であり、かつ、連続して所定時間（例えば、５秒など）以上、所定以上の大きさの音声が継続して検出される箇所である。発話箇所検出部１０２は、音声の大きさが所定以上の大きさの箇所である発話箇所であり、一のオブジェクト識別子と対になる音声と、ほぼ連続する他のオブジェクト識別子と対になる音声を有する対話の箇所である発話箇所（かかる場合、対話箇所ともいう）を検出しても良い。発話箇所検出部１０２は、通常、ＭＰＵやメモリ等から実現され得る。発話箇所検出部１０２の処理手順は、通常、ソフトウェアで実現され、当該ソフトウェアはＲＯＭ等の記録媒体に記録されている。但し、ハードウェア（専用回路）で実現しても良い。なお、箇所とは、動画や音声など時間軸を有する情報のうちのある時間分の部分の情報、または当該部分の情報を特定する情報である。 The utterance part detection unit 102 detects an utterance part that is a part where the utterance is made based on the voice. The utterance location is usually a location where the volume of the voice is greater than or equal to a predetermined level, and the voice of a predetermined level or higher is continuously detected for a predetermined time (for example, 5 seconds) or longer. It is a place. The utterance location detection unit 102 is an utterance location where the size of the voice is a predetermined size or more, and a voice that is paired with one object identifier and a voice that is paired with another object identifier that is substantially continuous. An utterance location (also referred to as a dialogue location in this case) may be detected. The utterance point detection unit 102 can be usually realized by an MPU, a memory, or the like. The processing procedure of the utterance location detection unit 102 is usually realized by software, and the software is recorded in a recording medium such as a ROM. However, it may be realized by hardware (dedicated circuit). In addition, a location is the information for the part for a certain time among the information which has time axes, such as a moving image and an audio | voice, or the information which specifies the information of the said part.

対向箇所検出部１０３は、２以上のオブジェクトが対向している箇所を検知する。ここで、「対向する」とは、真正面から向きあう必要はなく、対話できる位置関係にあれば良い。また、人と展示物が向かい合う位置に居ることも対向という。対向箇所検出部１０３は、例えば、ユーザＡが保持している「ＩＲトラッカ」が取得するオブジェクト識別子がユーザＢのオブジェクト識別子であり、ユーザＢが保持している「ＩＲトラッカ」が取得するオブジェクト識別子がユーザＡのオブジェクト識別子である場合、ユーザＡとユーザＢは、向かい合っている（対向している）と判断する。つまり、ユーザＡの「ＩＲトラッカ」を有する情報取得装置から送信されるオブジェクト識別子が、ユーザＢのオブジェクト識別子であり、かつ、ユーザＢの「ＩＲトラッカ」を有する情報取得装置から送信されるオブジェクト識別子が、ユーザＡのオブジェクト識別子である場合に、ユーザＡとユーザＢは対向していると、対向箇所検出部１０３は判断する。なお、映像を画像解析することにより、２以上のオブジェクトが対向していることを検知しても良い。その他、「対向する」ことを認識するアルゴリズムは問わない。対向箇所検出部１０３は、通常、ＭＰＵやメモリ等から実現され得る。対向箇所検出部１０３の処理手順は、通常、ソフトウェアで実現され、当該ソフトウェアはＲＯＭ等の記録媒体に記録されている。但し、ハードウェア（専用回路）で実現しても良い。 The facing location detection unit 103 detects a location where two or more objects are facing each other. Here, “facing” does not need to face from the front, but may be in a positional relationship that allows dialogue. In addition, being in a position where people and exhibits face each other is also called opposite. For example, the opposite location detection unit 103 has the object identifier acquired by the “IR tracker” held by the user A as the object identifier of the user B, and the object identifier acquired by the “IR tracker” held by the user B. Is the object identifier of user A, it is determined that user A and user B are facing each other (facing each other). That is, the object identifier transmitted from the information acquisition apparatus having the “IR tracker” of the user A is the object identifier of the user B, and the object identifier transmitted from the information acquisition apparatus having the “IR tracker” of the user B Is the object identifier of the user A, the facing location detection unit 103 determines that the user A and the user B are facing each other. Note that it may be detected that two or more objects are facing each other by analyzing the image of the video. In addition, the algorithm for recognizing “opposing” does not matter. The facing location detection unit 103 can be usually realized by an MPU, a memory, or the like. The processing procedure of the facing part detection unit 103 is usually realized by software, and the software is recorded in a recording medium such as a ROM. However, it may be realized by hardware (dedicated circuit).

静止画抽出部１０４は、発話箇所検出部１０２が検出した発話箇所を構成する１以上の静止画を映像から抽出する。また、静止画抽出部１０４は、対向箇所検出部１０３が検出した対向箇所を構成する映像の中から、１以上の静止画を抽出する。また、静止画抽出部１０４は、発話箇所または／および対向箇所を構成する１以上の静止画を映像から繰り返し抽出しても良い。「静止画を映像から抽出する」とは、静止画自体を取得しても良いし、静止画を取得するための、映像中のポインタ情報を取得しても良い。静止画抽出部１０４は、通常、ＭＰＵやメモリ等から実現され得る。静止画抽出部１０４の処理手順は、通常、ソフトウェアで実現され、当該ソフトウェアはＲＯＭ等の記録媒体に記録されている。但し、ハードウェア（専用回路）で実現しても良い。 The still image extraction unit 104 extracts one or more still images constituting the utterance location detected by the utterance location detection unit 102 from the video. Further, the still image extraction unit 104 extracts one or more still images from the video that configures the opposite location detected by the opposite location detection unit 103. Still picture extraction part 104 may repeatedly extract one or more still pictures which constitute a speech part or / and an opposite part from a picture. “Extracting a still image from a video” may acquire the still image itself, or may acquire pointer information in the video for acquiring the still image. The still image extraction unit 104 can be usually realized by an MPU, a memory, or the like. The processing procedure of the still image extraction unit 104 is usually realized by software, and the software is recorded on a recording medium such as a ROM. However, it may be realized by hardware (dedicated circuit).

位置検出部１０５は、映像を分析し、オブジェクトの少なくとも周辺の位置を検出する。位置検出部１０５は、映像中の赤外線の発信位置（赤外線ＩＤタグが存在する箇所）を２以上の静止画を解析することにより検知し、当該発信位置をオブジェクト（人や展示物など）の位置であると検出しても良い。人や展示物等のオブジェクトは、赤外線を発信する赤外線ＩＤタグを保持している。位置検出部１０５は、通常、ＭＰＵやメモリ等から実現され得る。位置検出部１０５の処理手順は、通常、ソフトウェアで実現され、当該ソフトウェアはＲＯＭ等の記録媒体に記録されている。但し、ハードウェア（専用回路）で実現しても良い。 The position detection unit 105 analyzes the video and detects at least a position around the object. The position detection unit 105 detects an infrared transmission position (a place where an infrared ID tag exists) in the video by analyzing two or more still images, and detects the transmission position of an object (such as a person or an exhibit). May be detected. Objects such as people and exhibits hold infrared ID tags that transmit infrared rays. The position detection unit 105 can usually be realized by an MPU, a memory, or the like. The processing procedure of the position detection unit 105 is usually realized by software, and the software is recorded in a recording medium such as a ROM. However, it may be realized by hardware (dedicated circuit).

場面種決定部１０６は、対話の箇所である発話箇所を構成するコンテンツを分析し、場面の種類を決定する。場面の種類とは、例えば、映像に写っている人（オブジェクトの一種）が展示物（オブジェクトの一種）を見ていることを示す第一の場面「対向」、映像に写っている人（オブジェクトの一種）が他の人（オブジェクトの一種）に話をしている第二の場面「被発話」、映像に写っている人（オブジェクトの一種）に他の人（オブジェクトの一種）が話をしている第三の場面「発話」、映像に写っている人（オブジェクトの一種）と他の人（オブジェクトの一種）がお互いに話をしている第四の場面「対話」等である。場面種決定部１０６は、通常、ＭＰＵやメモリ等から実現され得る。場面種決定部１０６の処理手順は、通常、ソフトウェアで実現され、当該ソフトウェアはＲＯＭ等の記録媒体に記録されている。但し、ハードウェア（専用回路）で実現しても良い。 The scene type determination unit 106 analyzes the content that constitutes the utterance location that is the location of the dialogue, and determines the type of scene. The type of scene is, for example, the first scene “opposite” that shows that a person (a type of object) in the video is looking at an exhibit (a type of object), or a person (object in the video) The second scene where a person is talking to another person (a kind of object), and another person (a kind of object) is talking to the person (a kind of object) in the video The third scene “speech” is the fourth scene “dialogue” in which a person (a kind of object) and another person (a kind of object) are talking to each other. The scene type determination unit 106 can usually be realized by an MPU, a memory, or the like. The processing procedure of the scene type determination unit 106 is usually realized by software, and the software is recorded on a recording medium such as a ROM. However, it may be realized by hardware (dedicated circuit).

非表示ユーザ発声検出部１０７は、発話箇所または対向箇所を構成するコンテンツを分析し、出力する静止画に対応する音声が、静止画に現れるユーザではないユーザの音声であることを検出する。非表示ユーザ発声検出部１０７は、通常、ＭＰＵやメモリ等から実現され得る。非表示ユーザ発声検出部１０７の処理手順は、通常、ソフトウェアで実現され、当該ソフトウェアはＲＯＭ等の記録媒体に記録されている。但し、ハードウェア（専用回路）で実現しても良い。 The non-display user utterance detection unit 107 analyzes the content constituting the utterance part or the opposite part, and detects that the voice corresponding to the still image to be output is the voice of the user who is not the user who appears in the still image. The non-display user utterance detection unit 107 can be usually realized by an MPU, a memory, or the like. The processing procedure of the non-display user utterance detection unit 107 is usually realized by software, and the software is recorded in a recording medium such as a ROM. However, it may be realized by hardware (dedicated circuit).

出力部１０８は、静止画抽出部１０４が抽出した１以上の静止画を１以上のほぼ重ならないウィンドウに出力する。このウィンドウは、通常、タイル式ウィンドウであるが、一部、重なっても良い。また、ウィンドウとは、ウィンドウシステムにおけるウィンドウとは限らない。表示の態様が区画されている態様であれば良い。出力部１０８は、静止画を漫画のように、表示することが好適である。漫画の技法において、枠が重なる技法もあり、かかる技法に従った静止画の表示をすることは好適である。出力部１０８は、例えば、静止画出力手段１０８１と吹出出力手段１０８２を具備する。ただし、出力部１０８において、吹出出力手段１０８２は必須ではない。 The output unit 108 outputs the one or more still images extracted by the still image extraction unit 104 to one or more windows that do not substantially overlap. This window is usually a tiled window, but may partially overlap. A window is not necessarily a window in a window system. Any display mode may be used. The output unit 108 preferably displays the still image like a cartoon. In the comic technique, there is a technique in which frames overlap, and it is preferable to display a still image according to such a technique. The output unit 108 includes, for example, a still image output unit 1081 and a blowout output unit 1082. However, the blowout output means 1082 is not essential in the output unit 108.

静止画出力手段１０８１は、静止画抽出部１０４が抽出した静止画を出力する。静止画を出力する態様は問わない。また、静止画出力手段１０８１は、静止画を含む映像と同期する音声を音声出力しても良い。静止画出力手段１０８１は、ディスプレイ等の出力デバイスを含むと考えても含まないと考えても良い。静止画出力手段１０８１は、出力デバイスのドライバーソフトまたは、出力デバイスのドライバーソフトと出力デバイス等で実現され得る。出力とは、ディスプレイへの表示、プリンタへの印字、音出力、外部の装置への送信等を含む概念である。 The still image output unit 1081 outputs the still image extracted by the still image extraction unit 104. A mode of outputting a still image is not limited. Still image output means 1081 may output audio that is synchronized with video including still images. The still image output unit 1081 may or may not include an output device such as a display. The still image output unit 1081 can be realized by driver software of an output device or driver software of an output device and an output device. Output is a concept that includes display on a display, printing on a printer, sound output, transmission to an external device, and the like.

吹出出力手段１０８２は、静止画出力手段１０８１が出力する静止画に重ねて吹き出しを出力する。吹出出力手段１０８２は、静止画を有する映像に対応する音声を分析し、当該分析結果に応じて２種類以上の形状の吹き出しを区別して出力することは好適である。吹出出力手段１０８２は、静止画を有する映像に対応する音声を分析し、発話の長さを取得し、当該長さに応じた長さを有する文字列を吹き出し内に出力することは好適である。吹出出力手段１０８２は、位置検出部１０５が検出した位置周辺に吹き出しを出力することは好適である。吹出出力手段１０８２は、非表示ユーザ発声検出部１０７が静止画に現れるユーザではないユーザの音声であることを検出した場合、ウィンドウの外または隅から吹き出しが現れる態様で吹き出しを出力することは好適である。「吹き出し」とは、発言やコメントや説明などを付記する枠であり、漫画等に利用されるものである。「吹き出し」の形状や、色などの出力態様は問わない。また、「吹き出し」の中には、何が表示されていても良いし、何も表示されていなくても良い。吹出出力手段１０８２は、ディスプレイデバイスを含むと考えても含まないと考えても良い。出力部１０８は、例えば、ディスプレイデバイスのドライバーソフトまたは、ディスプレイデバイスのドライバーソフトとディスプレイデバイス等で実現され得る。出力とは、ディスプレイへの表示、プリンタへの印字、外部の装置への送信等を含む概念である。 The blowout output unit 1082 outputs a balloon superimposed on the still image output from the still image output unit 1081. It is preferable that the blowout output means 1082 analyze the audio corresponding to the video having a still image, and distinguish and output two or more types of balloons according to the analysis result. It is preferable that the balloon output means 1082 analyzes the voice corresponding to the video having a still image, acquires the length of the utterance, and outputs a character string having a length corresponding to the length in the balloon. . It is preferable that the blowout output means 1082 outputs a balloon around the position detected by the position detection unit 105. When the non-display user utterance detection unit 107 detects that the voice of a user who is not a user who appears in a still image is detected, the balloon output unit 1082 preferably outputs a balloon in a manner in which a balloon appears from the outside or corner of the window. It is. The “speech balloon” is a frame for adding comments, comments, explanations, and the like, and is used for comics and the like. The form of “speech balloon” and the output mode such as color are not limited. Also, anything may be displayed in the “speech balloon”, or nothing may be displayed. The blowout output means 1082 may be considered as including or not including a display device. The output unit 108 can be realized by, for example, display device driver software, or display device driver software and a display device. The output is a concept including display on a display, printing on a printer, transmission to an external device, and the like.

次に、情報処理装置の動作について図３から図９のフローチャートを用いて説明する。なお、本動作の例は、例えば、以下のような状況であることを想定する。つまり、ユーザ（オブジェクトの一種であり、以下、適宜「主人公」という）が、展示物、展示パネルなどが展示されている会場を歩きながら、かつ展示物を見ながら、かつ展示物の説明員等と議論をしながら、会場をめぐる。その際に、主人公は、図２に示すような情報取得装置を装着している。また、会場に居る他の見学者、展示物の説明員、展示物も、図２に示すような情報取得装置を装着している。そして、各オブジェクト（見学者、説明員、展示物などを含む）が取得した映像、音声、オブジェクト識別子などを有するコンテンツが、情報処理装置に蓄えられている。かかる場合の情報処理装置の動作について説明する。 Next, the operation of the information processing apparatus will be described using the flowcharts of FIGS. In addition, the example of this operation | movement assumes that it is the following situations, for example. In other words, a user (a kind of object, hereinafter referred to as “the main character” as appropriate) walks through the venue where exhibits, exhibition panels, etc. are displayed, and while viewing the exhibits, and the exhibitors of the exhibits. Talking about the venue. At that time, the main character wears an information acquisition device as shown in FIG. Further, other visitors, exhibitors, and exhibits in the venue are also equipped with information acquisition devices as shown in FIG. And the content which has the image | video, audio | voice, object identifier, etc. which each object (including a visitor, an instructor, an exhibit, etc.) acquired is stored in the information processing apparatus. An operation of the information processing apparatus in such a case will be described.

（ステップＳ３０１）発話箇所検出部１０２は、主人公が装着している情報取得装置が蓄積したコンテンツを、コンテンツ格納部１０１から取得する。なお、通常、情報処理装置は、主人公を識別するオブジェクト識別子の入力を受け付け、かかるオブジェクト識別子に対応するコンテンツを取得する。
（ステップＳ３０２）発話箇所検出部１０２は、ステップＳ３０１で取得したコンテンツ等に基づいて、発話箇所を検出する。発話箇所を検出する処理の詳細は、図４、図５のフローチャートを用いて説明する。 (Step S <b> 301) The utterance point detection unit 102 acquires the content accumulated by the information acquisition device worn by the hero from the content storage unit 101. Normally, the information processing apparatus receives an input of an object identifier that identifies a hero and acquires content corresponding to the object identifier.
(Step S302) The utterance location detection unit 102 detects an utterance location based on the content acquired in step S301. Details of the processing for detecting an utterance location will be described with reference to the flowcharts of FIGS.

（ステップＳ３０３）対向箇所検出部１０３は、ステップＳ３０１で取得したコンテンツ中の箇所であり、２以上のオブジェクトが対向している箇所を検知する。かかる対向箇所の検知処理の詳細は、図６のフローチャートを用いて説明する。 (Step S303) The facing part detection unit 103 detects a part in the content acquired in step S301, where two or more objects are facing each other. The details of the processing for detecting the opposite location will be described with reference to the flowchart of FIG.

（ステップＳ３０４）出力部１０８は、出力する際のコマ割りを決定する。コマ割りとは、画面上のウィンドウの配置、大きさ等の決定である。コマ割りの決定処理は、例えば、予め決められているコマ割りを示す情報であるコマ割情報を選択する処理である。また、コマ割りの決定処理は、例えば、ウィンドウの数、大きさを自由に決定しても良い。コマ割りの決定処理の詳細は、図７のフローチャートを用いて説明する。
（ステップＳ３０５）出力部１０８は、カウンタｉに１を代入する。 (Step S304) The output unit 108 determines the frame division at the time of output. Frame division is determination of the arrangement and size of windows on the screen. The frame division determination process is, for example, a process of selecting frame division information that is information indicating a predetermined frame division. In the frame division determination process, for example, the number and size of windows may be freely determined. Details of the frame division determination process will be described with reference to the flowchart of FIG.
(Step S305) The output unit 108 substitutes 1 for the counter i.

（ステップＳ３０６）出力部１０８は、ｉ番目の箇所（発話箇所、または対向箇所）が存在するか否かを判断する。ｉ番目の箇所が存在すればステップＳ３０７に行き、ｉ番目の箇所が存在しなければステップＳ３０９に行く。 (Step S306) The output unit 108 determines whether or not the i-th location (speech location or opposite location) exists. If the i-th location exists, the process goes to step S307, and if the i-th location does not exist, the process goes to step S309.

（ステップＳ３０７）出力部１０８は、ｉ番目の箇所の映像を構成する一静止画を出力するコマを決定する。出力部１０８は、例えば、ｉ番目の箇所の時間が長ければ、サイズの大きなコマに配置し、ｉ番目の箇所の時間が短ければ、サイズの小さなコマに配置することは好適である。かかる場合の詳細なアルゴリズム例は、以下である。まず、出力部１０８は、全箇所の数Ｐ（例えば、１０）を取得する。そして、出力部１０８は、ステップＳ３０４で決定したコマ割りのコマの数をＮ（例えば、５）を取得する。次に、Ｎ個のコマをコマの大きさ順に並べる。そして、すべての箇所の時間を取得しておき、ｉ番目の箇所の時間がＭ番目（Ｍは、例えば、２）の長い箇所であると、取得する、とする。次に、Ｐ，Ｎ，Ｍをパラメータとして、時間が長い箇所に対応する静止画は、大きなサイズのコマに出力されるようにする。出力部１０８は、Ｐ＝１０、Ｎ＝５、Ｍ＝２の場合、１コマに２つの箇所の静止画を、切り替えて表示することとなるので、出力部１０８は、ｉ番目の箇所の静止画は最も大きなコマに表示されるように、割り当てる。
（ステップＳ３０８）出力部１０８は、カウンタｉを１、インクリメントする。ステップＳ３０６に戻る。
（ステップＳ３０９）出力部１０８は、カウンタｉに１を代入する。
（ステップＳ３１０）出力部１０８は、ｉ番目の箇所が存在するか否かを判断する。ｉ番目の箇所が存在すればステップＳ３１１に行き、ｉ番目の箇所が存在しなければ処理を終了する。 (Step S307) The output unit 108 determines a frame for outputting one still image constituting the video of the i-th location. For example, the output unit 108 is preferably arranged in a large-sized frame if the time of the i-th location is long, and is arranged in a small-sized frame if the time of the i-th location is short. A detailed algorithm example in this case is as follows. First, the output unit 108 acquires the number P (for example, 10) of all locations. Then, the output unit 108 acquires N (for example, 5) as the number of frames divided in step S304. Next, N frames are arranged in order of frame size. Then, it is assumed that the times of all the locations are acquired and the time of the i-th location is acquired as the M-th (M is, for example, 2) long location. Next, P, N, and M are used as parameters so that a still image corresponding to a portion having a long time is output to a large-sized frame. When P = 10, N = 5, and M = 2, the output unit 108 switches and displays two still images in one frame. Therefore, the output unit 108 displays the i-th still image. The image is assigned so that it is displayed on the largest frame.
(Step S308) The output unit 108 increments the counter i by one. The process returns to step S306.
(Step S309) The output unit 108 substitutes 1 for the counter i.
(Step S310) The output unit 108 determines whether or not the i-th location exists. If the i-th location exists, the process proceeds to step S311. If the i-th location does not exist, the process ends.

（ステップＳ３１１）出力部１０８は、ｉ番目の箇所のコンテンツが有する映像のうち、一の静止画を取得する。一の静止画を取得するアルゴリズムは問わない。出力部１０８は、例えば、ｉ番目の箇所の時間の中の真ん中の静止画を取得する。また、出力部１０８は、例えば、ｉ番目の箇所のコンテンツの最初の静止画を取得しても良い。
（ステップＳ３１２）出力部１０８は、吹き出しの処理を行う。吹き出し処理の詳細については、図８のフローチャートを用いて説明する。 (Step S <b> 311) The output unit 108 acquires one still image among videos included in the content at the i-th location. There is no limitation on the algorithm for acquiring one still image. For example, the output unit 108 acquires the middle still image in the time of the i-th location. Further, the output unit 108 may acquire the first still image of the content at the i-th location, for example.
(Step S312) The output unit 108 performs a balloon process. Details of the balloon processing will be described with reference to the flowchart of FIG.

（ステップＳ３１３）出力部１０８は、ステップＳ３１１で取得した静止画と、ステップＳ３１２で取得した吹き出し画像に基づいて、静止画を生成する。出力する静止画を生成する処理の詳細については、図９のフローチャートを用いて説明する。
（ステップＳ３１４）出力部１０８は、ステップＳ３１３で生成した静止画を、割り当てられたコマ（ウィンドウ）に表示する。各静止画のコマへの割り当ては、ステップＳ３０７において決定されている。
（ステップＳ３１５）出力部１０８は、カウンタｉを１、インクリメントする。ステップＳ３１０に戻る。
なお、図３のフローチャートにおいて、ｉ番目の箇所に対応する音声も出力しても良い。なお、音声の出力は、例えば、２以上の静止画を出力している場合、一の音声を選択して出力しても良い。
なお、図３のフローチャートにおいて、処理は、終了せずに、繰り返し、再度、ステップＳ３０１からの処理を行っても良い。 (Step S313) The output unit 108 generates a still image based on the still image acquired in step S311 and the balloon image acquired in step S312. Details of processing for generating a still image to be output will be described with reference to the flowchart of FIG.
(Step S314) The output unit 108 displays the still image generated in step S313 on the assigned frame (window). The assignment of each still image to a frame is determined in step S307.
(Step S315) The output unit 108 increments the counter i by one. The process returns to step S310.
In the flowchart of FIG. 3, a voice corresponding to the i-th location may also be output. For example, when two or more still images are output, one sound may be selected and output.
In the flowchart of FIG. 3, the process may be repeated without repeating and the process from step S301 may be performed again.

次に、発話箇所を検出する処理の詳細動作について図４、図５のフローチャートを用いて説明する。本発話箇所とは、一のオブジェクトが一方的に話をしている発話の箇所と、二人のオブジェクトがお互いに話をしている対話の箇所がある。 Next, the detailed operation of the processing for detecting the utterance location will be described with reference to the flowcharts of FIGS. The utterance part includes an utterance part where one object talks unilaterally and a conversation part where two objects talk to each other.

（ステップＳ４０１）発話箇所検出部１０２は、主人公の音声であって、所定時間以上、ほぼ連続する音声の箇所をすべて取得し、マークする。音声の箇所は、音声のデータ自体であっても良いし、コンテンツ中の始点、および終点の情報でも良い。その他、音声の箇所のデータ構造は問わない。なお、通常、コンテンツ格納部１０１において、コンテンツは、オブジェクト識別子と対に管理されている。また、「マークする」とは、例えば、音声の箇所を示す箇所情報（たとえば、主人公のオブジェクト識別子、始点、終点の情報、場面種識別子を有する）の場面種識別子に「１」を書き込むことである。場面種識別子とは、切り出した箇所の場面（インタラクション）の種類を示す情報である。場面種識別子は、例えば、「１」は「発話」（一方的な発声）、「２」は「対話」（双方向の会話）、「３」は「対向」（主人公が展示物を見ていること）である。 (Step S401) The utterance part detection unit 102 acquires and marks all the parts of the hero's voice that are substantially continuous for a predetermined time or longer. The audio portion may be audio data itself, or may be information on the start point and end point in the content. In addition, the data structure of the voice part is not limited. Normally, in the content storage unit 101, content is managed in pairs with object identifiers. “Marking” means, for example, that “1” is written in the scene type identifier of location information indicating the location of the audio (for example, the main character has an object identifier, start point and end point information, and a scene type identifier). is there. The scene type identifier is information indicating the type of scene (interaction) at the cut out location. For example, “1” is “speech” (unilateral utterance), “2” is “dialogue” (two-way conversation), “3” is “opposite” (the main character sees the exhibition) It is that you are.

（ステップＳ４０２）発話箇所検出部１０２は、主人公以外の音声であって、所定時間以上、ほぼ連続する音声の箇所をすべて取得し、マークする。音声の箇所は、を示す箇所情報は、例えば、主人公以外のオブジェクト識別子、始点、終点の情報、場面種識別子を有する。そして、場面種識別子には「１」（発話を示す）を書き込む。
（ステップＳ４０３）発話箇所検出部１０２は、カウンタｉ，ｊに１を代入する。 (Step S402) The utterance location detection unit 102 acquires and marks all speech locations that are voices other than the hero and are substantially continuous for a predetermined time or longer. The location information indicating the location of the audio includes, for example, an object identifier other than the main character, information on the start point and end point, and a scene type identifier. Then, “1” (indicating utterance) is written in the scene type identifier.
(Step S403) The utterance part detection unit 102 substitutes 1 for the counters i and j.

（ステップＳ４０４）発話箇所検出部１０２は、ステップＳ４０１で取得した音声箇所のうち、ｉ番目の主人公の音声箇所が存在するか否かを判断する。ｉ番目の主人公の音声箇所が存在すればステップＳ４０５に行き、ｉ番目の主人公の音声箇所が存在しなければ上位関数にリターンする。
（ステップＳ４０５）発話箇所検出部１０２は、発話箇所検出部１０２は、ｉ番目の主人公の音声箇所を取得する。 (Step S404) The speech part detection unit 102 determines whether or not the voice part of the i-th main character exists among the voice parts acquired in Step S401. If there is a voice part of the i-th main character, the process goes to step S405, and if there is no voice part of the i-th main character, the process returns to the upper function.
(Step S405) The utterance part detection unit 102 acquires the voice part of the i-th main character.

（ステップＳ４０６）発話箇所検出部１０２は、ステップＳ４０２で取得した音声箇所のうち、ｊ番目の他人の音声箇所が存在するか否かを判断する。ｊ番目の他人の音声箇所が存在すればステップＳ４０７に行き、ｊ番目の他人の音声箇所が存在しなければステップＳ４１４に行く。
（ステップＳ４０７）発話箇所検出部１０２は、ｊ番目の他人の音声箇所を取得する。 (Step S406) The utterance part detection unit 102 determines whether or not there is a jth other person's voice part among the voice parts acquired in Step S402. If the jth other person's voice part exists, the process goes to step S407, and if the jth other person's voice part does not exist, the process goes to step S414.
(Step S407) The utterance location detection unit 102 acquires the speech location of the jth other person.

（ステップＳ４０８）発話箇所検出部１０２は、ｉ番目の主人公の音声箇所と、ｊ番目の他人の音声箇所が、ほぼ連続するか否かを判断する。２つの音声がほぼ連続するか否かの判断方法を、図１０を用いて説明する。「ほぼ連続」とは、図１０の（ａ）（ｂ）（ｃ）の状況を言う。つまり、（ａ）は、２つの音声（音声１、音声２）が、少しの重複（ｘ）だけで、引き続いて、出力されていることを示す。かかる場合のｘの時間間隔は、例えば、「３秒以下」が好適である。また、（ｂ）は、２つの音声（音声１、音声２）が、重複も隙間もなく、連続している場合である。さらに、（ｃ）は、２つの音声（音声１、音声２）が発せされる間に、少しの時間間隔（ｙ）を有する場合である。なお、「ｙ」は、例えば「２秒以下」が好適である。また、図１０（ｄ）（ｅ）に示す態様は、「ほぼ連続」とは、判断され得ない。二人の発声者が重複して発声している時間が、所定時間以上であるからである。なお、図１０において、線は、発声していることを示し、横軸は時間（ｔ）である。また、音声１と音声２は、異なる者が発声した音声（異なる発声者識別子と対になる音声）である。 (Step S408) The utterance part detection unit 102 determines whether or not the voice part of the i-th main character and the voice part of the j-th other person are substantially continuous. A method for determining whether two voices are substantially continuous will be described with reference to FIG. “Substantially continuous” refers to the situation of (a), (b), and (c) in FIG. That is, (a) indicates that two sounds (sound 1, sound 2) are output continuously with only a slight overlap (x). In this case, the time interval of x is preferably “3 seconds or less”, for example. Further, (b) is a case where two sounds (sound 1 and sound 2) are continuous without overlapping or gaps. Furthermore, (c) is a case where there is a short time interval (y) between two voices (voice 1, voice 2). For example, “y” is preferably “2 seconds or less”. Further, the modes shown in FIGS. 10D and 10E cannot be determined to be “substantially continuous”. This is because the time during which the two speakers speak twice is a predetermined time or more. In FIG. 10, the line indicates that the user is speaking, and the horizontal axis indicates time (t). Voice 1 and voice 2 are voices uttered by different persons (voices paired with different speaker identifiers).

（ステップＳ４０９）発話箇所検出部１０２は、ｉ番目の主人公の音声箇所と、ｊ番目の他人の音声箇所をグループ化する。「２つの音声箇所をグループ化する」とは、２つの音声箇所が、対話であることを識別できるようにすることであり、例えば、ｉ番目の主人公の音声箇所と、ｊ番目の他人の音声箇所が有する場面種識別子に「２」を書き込み、かつ２つの音声箇所を対応付ける。 (Step S409) The utterance part detection unit 102 groups the voice part of the i-th main character and the voice part of the j-th other person. “Grouping two voice locations” means that the two voice locations can be identified as dialogues, for example, the voice location of the i-th main character and the voice of the j-th other person. “2” is written in the scene type identifier of the location, and the two audio locations are associated with each other.

（ステップＳ４１０）発話箇所検出部１０２は、ｊ番目の他人の音声箇所が、ｉ番目の主人公の音声箇所と比較して、時間的に後であるか否かを判断する。後であればステップＳ４１１に行き、後でなければステップＳ４１２に行く。
（ステップＳ４１１）発話箇所検出部１０２は、カウンタｉを１、インクリメントする。ステップＳ４０４に戻る。
（ステップＳ４１２）発話箇所検出部１０２は、カウンタｊを１、インクリメントする。ステップＳ４０４に戻る。
（ステップＳ４１３）発話箇所検出部１０２は、カウンタｊを１、インクリメントする。ステップＳ４０６に戻る。
（ステップＳ４１４）発話箇所検出部１０２は、カウンタｉを１、インクリメントする。
（ステップＳ４１５）発話箇所検出部１０２は、カウンタｊを１、インクリメントする。ステップＳ４０４に戻る。
なお、発話箇所の検出アルゴリズムは、上記に限られないことは言うまでもない。
次に、他人の音声箇所を取得する動作について図５のフローチャートを用いて説明する。 (Step S410) The utterance part detection unit 102 determines whether or not the voice part of the j-th other person is later in time than the voice part of the i-th main character. If so, go to Step S411, and if not, go to Step S412.
(Step S411) The utterance part detection unit 102 increments the counter i by one. The process returns to step S404.
(Step S412) The utterance part detection unit 102 increments the counter j by 1. The process returns to step S404.
(Step S413) The utterance part detection unit 102 increments the counter j by 1. The process returns to step S406.
(Step S414) The utterance location detection unit 102 increments the counter i by 1.
(Step S415) The utterance part detection unit 102 increments the counter j by 1. The process returns to step S404.
Needless to say, the algorithm for detecting the utterance location is not limited to the above.
Next, the operation | movement which acquires another person's audio | voice part is demonstrated using the flowchart of FIG.

（ステップＳ５０１）発話箇所検出部１０２は、主人公のオブジェクト識別子と対になるコンテンツが有する、すべてのオブジェクト識別子を取得する。このオブジェクト識別子は、主人公が保持している情報取得装置が取得したオブジェクト識別子であり、主人公と対向した人や展示物等を識別する情報である。
（ステップＳ５０２）発話箇所検出部１０２は、カウンタｉに１を代入する。 (Step S501) The utterance part detection unit 102 acquires all object identifiers included in the content that is paired with the object identifier of the main character. This object identifier is an object identifier acquired by the information acquisition device held by the main character, and is information for identifying a person, an exhibit, or the like facing the main character.
(Step S502) The utterance point detection unit 102 substitutes 1 for a counter i.

（ステップＳ５０３）発話箇所検出部１０２は、ステップＳ５０１の中に、ｉ番目のオブジェクト識別子が存在するか否かを判断する。発話箇所検出部１０２は、ｉ番目のオブジェクト識別子が存在すればステップＳ５０４に行き、ｉ番目のオブジェクト識別子が存在しなければ上位関数にリターンする。
（ステップＳ５０４）発話箇所検出部１０２は、ｉ番目のオブジェクト識別子に対応する音声（ｉ番目のオブジェクト識別子のユーザが発声した音声）の中で、所定の時間以上の音声の箇所をすべて取得する。
（ステップＳ５０５）発話箇所検出部１０２は、カウンタｊに１を代入する。 (Step S503) The utterance location detection unit 102 determines whether or not the i-th object identifier exists in step S501. The utterance location detection unit 102 goes to step S504 if the i-th object identifier exists, and returns to the upper function if the i-th object identifier does not exist.
(Step S <b> 504) The utterance location detection unit 102 acquires all speech locations that are equal to or longer than a predetermined time in the speech corresponding to the i-th object identifier (speech uttered by the user of the i-th object identifier).
(Step S505) The utterance part detection unit 102 substitutes 1 for a counter j.

（ステップＳ５０６）発話箇所検出部１０２は、ステップＳ５０４で取得した音声箇所の中で、ｊ番目の音声箇所が存在するか否かを判断する。発話箇所検出部１０２は、ｊ番目の音声箇所が存在すればステップＳ５０７に行き、ｊ番目の音声箇所が存在しなければステップＳ５１１に行く。 (Step S506) The utterance part detection unit 102 determines whether or not the jth voice part exists in the voice part acquired in step S504. The utterance part detection unit 102 goes to step S507 if the jth voice part exists, and goes to step S511 if the jth voice part does not exist.

（ステップＳ５０７）発話箇所検出部１０２は、ｊ番目の音声箇所に対応するオブジェクト識別子を取得する。このオブジェクト識別子は、本他人が装着している情報取得装置が、かかる音声箇所の時間帯において、取得したオブジェクト識別子である。つまり、このオブジェクト識別子で識別されるオブジェクトは、本音声箇所の時間帯において、本他人と対向していたことを示す。 (Step S507) The utterance location detection unit 102 acquires an object identifier corresponding to the jth speech location. This object identifier is an object identifier acquired by the information acquisition device worn by the other person in the time zone of the sound part. That is, the object identified by this object identifier indicates that it is facing the other person in the time zone of the present audio part.

（ステップＳ５０８）発話箇所検出部１０２は、ステップＳ５０７で取得したオブジェクト識別子のうち、主人公のオブジェクト識別子が存在するか否かを判断する。主人公のオブジェクト識別子が存在すればステップＳ５０９に行き、主人公のオブジェクト識別子が存在しなければステップＳ５１０に飛ぶ。
（ステップＳ５０９）発話箇所検出部１０２は、ｊ番目の音声箇所を取得し、マークする。
（ステップＳ５１０）発話箇所検出部１０２は、カウンタｊを１、インクリメントする。ステップＳ５０６に行く。
（ステップＳ５１１）発話箇所検出部１０２は、カウンタｉを１、インクリメントする。ステップＳ５０３に行く。
次に、対向箇所の検知処理例の詳細について図６のフローチャートを用いて説明する。 (Step S508) The utterance location detection unit 102 determines whether or not the main character's object identifier exists among the object identifiers acquired in Step S507. If the main character object identifier exists, the process goes to step S509, and if the main character object identifier does not exist, the process jumps to step S510.
(Step S509) The utterance location detection unit 102 acquires and marks the j-th speech location.
(Step S510) The utterance part detection unit 102 increments the counter j by 1. Go to step S506.
(Step S511) The utterance part detection unit 102 increments the counter i by 1. Go to step S503.
Next, details of an example of processing for detecting a facing portion will be described with reference to the flowchart of FIG.

（ステップＳ６０１）対向箇所検出部１０３は、主人公のオブジェクト識別子に対応するコンテンツが有するオブジェクト識別子を検査し、所定時間以上、ほぼ連続して取得したオブジェクト識別子、およびその時間に関する情報である時間情報の組をすべて取得する。
（ステップＳ６０２）対向箇所検出部１０３は、カウンタｉに１を代入する。 (Step S601) The opposite location detection unit 103 examines an object identifier included in the content corresponding to the object identifier of the main character, and obtains the object identifier acquired almost continuously for a predetermined time or more, and time information that is information about the time. Get all pairs.
(Step S602) The facing part detection unit 103 substitutes 1 for a counter i.

（ステップＳ６０３）対向箇所検出部１０３は、ステップＳ６０１で取得したオブジェクト識別子と時間情報の組の中に、ｉ番目の組の情報が存在するか否か判断する。ｉ番目の組の情報が存在すればステップＳ６０４に行き、ｉ番目の組の情報が存在しなければ上位関数にリターンする。
（ステップＳ６０４）対向箇所検出部１０３は、ｉ番目の組の情報が有するオブジェクト識別子と対になるコンテンツの、当該組の情報が有する時間情報が示す時間におけるオブジェクト識別子を、すべて取得する。 (Step S603) The facing part detection unit 103 determines whether or not the i-th set of information exists in the set of the object identifier and time information acquired in step S601. If the i-th set of information exists, the process goes to step S604, and if the i-th set of information does not exist, the process returns to the upper function.
(Step S604) The facing part detection unit 103 acquires all the object identifiers at the time indicated by the time information included in the information of the set of the content that is paired with the object identifier included in the i-th set of information.

（ステップＳ６０５）対向箇所検出部１０３は、ステップＳ６０４で取得したオブジェクト識別子に基づいて、主人公のオブジェクト識別子が所定の条件を満たすか否かを判断する。この所定の条件は、ｉ番目の組の情報が有するオブジェクト識別子で識別されるオブジェクトと主人公が所定の時間以上、対向していたことを認識するための条件である。ここでは、本条件は、例えば、ｉ番目の組の情報が有する時間情報が示す間におけるオブジェクト識別子の中に、主人公のオブジェクト識別子が所定の間隔を空けずに存在することである。所定の条件を満たせばステップＳ６０６に行き、所定の条件を満たさなければステップＳ６０８に飛ぶ。 (Step S605) The facing part detection unit 103 determines whether the object identifier of the main character satisfies a predetermined condition based on the object identifier acquired in Step S604. This predetermined condition is a condition for recognizing that the object identified by the object identifier included in the i-th set of information and the hero face each other for a predetermined time or more. Here, this condition is, for example, that the object identifier of the main character exists without leaving a predetermined interval among the object identifiers indicated by the time information included in the i-th set of information. If the predetermined condition is satisfied, the process proceeds to step S606, and if the predetermined condition is not satisfied, the process jumps to step S608.

（ステップＳ６０６）対向箇所検出部１０３は、ｉ番目の組の情報が有する時間情報が示す時間において、主人公および、主人公に対向するオブジェクトが発声していないか否かを判断する。発声していなければステップＳ６０７に行き、発声していていればステップＳ６０８に飛ぶ。
（ステップＳ６０７）対向箇所検出部１０３は、ｉ番目の組の情報が有する時間情報に基づいて、対向箇所を取得し、マークする。
（ステップＳ６０８）対向箇所検出部１０３は、カウンタｉを１、インクリメントする。ステップＳ６０３に行く。
次に、コマ割りを決定する動作の例について図７のフローチャートを用いて説明する。なお、ここでは、出力部１０８は、コマ割りを示す情報であるコマ割情報を２以上格納している、とする。
（ステップＳ７０１）出力部１０８は、先に検出した箇所（発話箇所、対話箇所、対向箇所）の数を取得する。
（ステップＳ７０２）出力部１０８は、先に検出したすべての箇所のデータサイズ（時間長と同意義）に関する情報を取得する。 (Step S606) The facing part detection unit 103 determines whether or not the hero and the object facing the hero are uttering at the time indicated by the time information included in the i-th set of information. If not uttered, the process goes to step S607, and if uttered, the process jumps to step S608.
(Step S607) The facing location detection unit 103 acquires and marks the facing location based on the time information included in the i-th set of information.
(Step S608) The facing part detection unit 103 increments the counter i by one. Go to step S603.
Next, an example of the operation for determining the frame division will be described with reference to the flowchart of FIG. Here, it is assumed that the output unit 108 stores two or more pieces of frame division information that is information indicating frame division.
(Step S701) The output unit 108 acquires the number of previously detected locations (speech locations, dialogue locations, opposing locations).
(Step S <b> 702) The output unit 108 acquires information regarding the data size (same meaning as time length) of all the previously detected locations.

（ステップＳ７０３）出力部１０８は、ステップＳ７０１で取得した全箇所の数、ステップＳ７０２で取得した各箇所のデータサイズに基づいて、コマ割情報を決定する。例えば、出力部１０８は、全箇所の数が多ければ、コマ数の多いコマ割情報を選択し、全箇所の数が少なければ、コマ数の少ないコマ割情報を選択する。また、例えば、出力部１０８は、データサイズのばらつきが少なければ、コマのサイズのばらつきが少ないコマ割情報を選択し、データサイズのばらつきが大きければ、コマのサイズのばらつきが大きいコマ割情報を選択する。ただし、コマ割情報を決定するアルゴリズムは問わない。
次に、吹き出し処理の詳細な動作例について図９のフローチャートを用いて説明する。
（ステップＳ８０１）吹出出力手段１０８２は、カウンタｉに１を代入する。 (Step S703) The output unit 108 determines frame allocation information based on the number of all locations acquired in step S701 and the data size of each location acquired in step S702. For example, the output unit 108 selects frame allocation information with a large number of frames if the number of all locations is large, and selects frame allocation information with a small number of frames if the number of all locations is small. Further, for example, the output unit 108 selects frame allocation information with a small frame size variation if the data size variation is small, and if the data size variation is large, the output unit 108 selects frame allocation information with a large frame size variation. select. However, the algorithm for determining the frame division information is not limited.
Next, a detailed operation example of the balloon process will be described with reference to the flowchart of FIG.
(Step S801) The blowout output means 1082 substitutes 1 for the counter i.

（ステップＳ８０２）吹出出力手段１０８２は、ステップＳ３１１で取得した静止画に対応する箇所（対話箇所等）において、ｉ番目の音声が存在するか否かを判断する。ｉ番目の音声が存在すればステップＳ８０３に行き、ｉ番目の音声が存在しなければステップＳ８１２に飛ぶ。なお、ｉ番目の音声は、主人公とそれ以外のもののすべての音声を対象とする。また、ｉ番目の音声は、所定の時間以上、発せられている音声群である。 (Step S802) The blowout output means 1082 determines whether or not the i-th sound is present at a location (such as a dialogue location) corresponding to the still image acquired at Step S311. If the i-th voice exists, the process goes to step S803, and if the i-th voice does not exist, the process jumps to step S812. Note that the i-th sound covers all sounds of the main character and others. The i-th sound is a sound group that has been emitted for a predetermined time or longer.

（ステップＳ８０３）吹出出力手段１０８２は、ｉ番目の音声が、主人公を識別するオブジェクト識別子と対になるコンテンツが有するオブジェクト識別子のオブジェクト（映像に写っている人）の音声であるか否かを判断する。当該オブジェクトの音声であればステップＳ８０４に行き、当該オブジェクトの音声でなければステップＳ８１１に飛ぶ。 (Step S803) The blowout output unit 1082 determines whether or not the i-th sound is the sound of the object (the person shown in the video) of the object identifier included in the content paired with the object identifier that identifies the main character. To do. If it is the sound of the object, the process goes to step S804, and if it is not the sound of the object, the process jumps to step S811.

（ステップＳ８０４）位置検出部１０５は、静止画中のオブジェクトの位置を取得する。例えば、位置検出部１０５は、静止画を含む映像を解析し、赤外線を発している位置を認識することにより、オブジェクトの位置が取得され得る。
（ステップＳ８０５）吹出出力手段１０８２は、ｉ番目の音声のデータ長を取得する。 (Step S804) The position detection unit 105 acquires the position of the object in the still image. For example, the position detection unit 105 can acquire the position of the object by analyzing a video including a still image and recognizing a position emitting infrared rays.
(Step S805) The blowout output unit 1082 acquires the data length of the i-th sound.

（ステップＳ８０６）吹出出力手段１０８２は、ステップＳ８０５で取得したデータ長に基づいて、吹き出しの種類を決定する。吹出出力手段１０８２は、例えば、音声のデータ長が短ければギザギザの吹き出しの情報で、長ければほぼ楕円形の吹き出しの情報を選択する。 (Step S806) The blowout output unit 1082 determines the type of the balloon based on the data length acquired in Step S805. The blowout output means 1082 selects, for example, jagged balloon information if the voice data length is short, and information on a substantially oval balloon if the voice data length is short.

（ステップＳ８０７）吹出出力手段１０８２は、ステップＳ８０５で取得したデータ長に基づいて、文字列を生成する。文字列は、例えば、意味のないものであり、吹出出力手段１０８２は、データ長に応じた文字列を、例えば、乱数を用いて生成する。吹出出力手段１０８２は、例えば、乱数を発生させ、当該乱数を文字コードに変換し、文字を得る。そして、吹出出力手段１０８２は、かかる処理をデータ長に応じて、繰り返し、文字列を生成する。
（ステップＳ８０８）吹出出力手段１０８２は、ステップＳ８０６で決定し、取得した吹き出しの情報と、ステップＳ８０７で生成した文字列から、吹き出しの画像を生成する。
（ステップＳ８０９）吹出出力手段１０８２は、オブジェクトの位置を吹き出しの画像の表示位置として設定する。
（ステップＳ８１０）吹出出力手段１０８２は、カウンタｉを１、インクリメントする。ステップＳ８０２に戻る。
（ステップＳ８１１）吹出出力手段１０８２は、オブジェクトの位置をコマ（ウィンドウ）の外部とする。外部とは、角の点を含み、例えば、ウィンドウ内の相対位置座標の（０，０）などでも良い。
（ステップＳ８１２）吹出出力手段１０８２は、カウンタｉが１であるか否かを判断する。カウンタｉが１であれば上位関数にリターンし、カウンタｉが１でなければステップＳ８１３に行く。 (Step S807) The blowout output unit 1082 generates a character string based on the data length acquired in Step S805. The character string is meaningless, for example, and the blowout output unit 1082 generates a character string corresponding to the data length using, for example, a random number. The blowout output means 1082 generates, for example, a random number, converts the random number into a character code, and obtains a character. The blowout output unit 1082 repeats such processing according to the data length to generate a character string.
(Step S808) The balloon output unit 1082 generates a balloon image from the balloon information determined and acquired in Step S806 and the character string generated in Step S807.
(Step S809) The balloon output unit 1082 sets the position of the object as the display position of the balloon image.
(Step S810) The blowout output means 1082 increments the counter i by 1. The process returns to step S802.
(Step S811) The blowout output unit 1082 sets the position of the object outside the frame (window). The outside includes corner points, and may be, for example, (0, 0) of relative position coordinates in the window.
(Step S812) The blowout output means 1082 determines whether or not the counter i is 1. If the counter i is 1, the process returns to the upper function. If the counter i is not 1, the process goes to step S813.

（ステップＳ８１３）吹出出力手段１０８２は、ステップＳ８０８で生成した１以上の吹き出し画像を、ステップＳ８０９で設定した表示位置に配置し、静止画に合成する吹き出し画像を合成する。上位関数にリターンする。
次に、出力画像生成処理の詳細な動作例について図９のフローチャートを用いて説明する。 (Step S813) The balloon output means 1082 arranges the one or more balloon images generated in step S808 at the display position set in step S809, and synthesizes the balloon image to be combined with the still image. Return to upper function.
Next, a detailed operation example of the output image generation processing will be described with reference to the flowchart of FIG.

（ステップＳ９０１）位置検出部１０５は、出力する静止画上のオブジェクトの位置を取得する。ここで取得するオブジェクトは、１つだけでも良いし、２以上でも良い。出力部１０８は、例えば、静止画上のオブジェクトの位置は、赤外線信号を発信している位置を画像認識することで取得する。
（ステップＳ９０２）出力部１０８は、静止画が含まれる箇所の場面の種類が一方向の発話であるか否かを判断する。発話であればステップＳ９０３に行き、発話でなければステップＳ９０６に飛ぶ。
（ステップＳ９０３）出力部１０８は、静止画中のオブジェクトを強調するためのハイライトの形状を、「発話」に対応する第一形状（例えば、円）と決定する。 (Step S901) The position detection unit 105 acquires the position of the object on the still image to be output. Only one object may be acquired here, or two or more objects may be acquired. For example, the output unit 108 acquires the position of the object on the still image by recognizing the position where the infrared signal is transmitted.
(Step S902) The output unit 108 determines whether or not the type of the scene where the still image is included is a one-way utterance. If it is an utterance, it will go to step S903, and if it is not an utterance, it will jump to step S906.
(Step S903) The output unit 108 determines the highlight shape for emphasizing the object in the still image as the first shape (for example, a circle) corresponding to “utterance”.

（ステップＳ９０４）出力部１０８は、ステップＳ９０３等で決定したハイライトの形状に基づいて、静止画に対してハイライト処理を行う。出力部１０８は、ステップＳ９０１で取得した位置を中心として、ステップＳ９０３で決定した形状についてはカラーにし、かつ他の領域はモノクロに前記静止画を修正する。つまり、通常、出力部１０８は、ステップＳ９０１で取得した位置を中心として、ステップＳ９０３で決定した形状の領域を除く領域について、カラー画像をモノクロ画像に変換する処理を行う。カラー画像をモノクロ画像に変換する処理は、公知技術であるので、詳細な説明は省略する。なお、ハイライトの方法は、上記に限らない。出力部１０８は、位置検出部１０５が検出した位置の周辺の領域と、他の領域とで出力態様を変更して、静止画を出力すれば良い。
（ステップＳ９０５）出力部１０８は、ステップＳ９０４で処理した静止画と、生成した吹き出し画像を合成し、出力する静止画を構成する。上位関数にリターンする。
（ステップＳ９０６）出力部１０８は、静止画が含まれる箇所の場面の種類が対話であるか否かを判断する。対話であればステップＳ９０７に行き、対話でなければステップＳ９０８に飛ぶ。
（ステップＳ９０７）出力部１０８は、静止画中のオブジェクトを強調するためのハイライトの形状を、「対話」に対応する第二形状（例えば、楕円）と決定する。ステップＳ９０４に行く。
（ステップＳ９０８）出力部１０８は、静止画が含まれる箇所の場面の種類が対向であるか否かを判断する。対向であればステップＳ９０９に行き、対向でなければステップＳ９１０に飛ぶ。
（ステップＳ９０９）出力部１０８は、静止画中のオブジェクトを強調するためのハイライトの形状を、「対向」に対応する第三形状（例えば、矩形）と決定する。ステップＳ９０４に行く。
（ステップＳ９１０）出力部１０８は、静止画中のオブジェクトを強調するためのハイライトの形状を、第四形状（例えば、星型）と決定する。ステップＳ９０４に行く。 (Step S904) The output unit 108 performs a highlight process on the still image based on the highlight shape determined in step S903 or the like. The output unit 108 uses the position acquired in step S901 as the center, changes the shape determined in step S903 to color, and corrects the still image to monochrome in other areas. That is, normally, the output unit 108 performs a process of converting a color image into a monochrome image with respect to the region excluding the region having the shape determined in step S903 with the position acquired in step S901 as the center. Since the process of converting a color image into a monochrome image is a known technique, detailed description thereof is omitted. The highlighting method is not limited to the above. The output unit 108 may output a still image by changing the output mode between a region around the position detected by the position detection unit 105 and another region.
(Step S905) The output unit 108 composes a still image to be output by combining the still image processed in step S904 with the generated balloon image. Return to upper function.
(Step S906) The output unit 108 determines whether or not the type of the scene where the still image is included is a dialogue. If it is a dialog, it will go to step S907, and if it is not a dialog, it will jump to step S908.
(Step S907) The output unit 108 determines a highlight shape for emphasizing an object in the still image as a second shape (for example, an ellipse) corresponding to “dialog”. Go to step S904.
(Step S908) The output unit 108 determines whether or not the scene type of the location including the still image is opposite. If so, go to Step S909, and if not, go to Step S910.
(Step S909) The output unit 108 determines the shape of the highlight for emphasizing the object in the still image as a third shape (for example, a rectangle) corresponding to “opposite”. Go to step S904.
(Step S910) The output unit 108 determines that the highlight shape for emphasizing the object in the still image is the fourth shape (for example, a star shape). Go to step S904.

以下、本実施の形態における情報処理装置の具体的な動作について説明する。本具体例において、主人公を含むユーザは、展示会や博物館などの見学している、とする。ここでは、展示会は、技術展示会である。また、主人公を含むユーザは、図２に示すような情報取得装置を装着している。さらに、展示物（技術展示のパネルなど）にも、図２に示すような情報取得装置、または、図２に示す情報取得装置に、マイク、スロート・マイク、ＨＭＤが存在しない情報取得装置が設置されている。また、本具体例において、主人公が、展示会のパネルを見ながら、他の人と議論をしながら、または他の人にパネル等の説明をしながら、数時間を過ごした場合の、情報処理装置の処理例について説明する。 Hereinafter, a specific operation of the information processing apparatus according to the present embodiment will be described. In this specific example, it is assumed that the user including the main character is visiting an exhibition or a museum. Here, the exhibition is a technology exhibition. In addition, users including the main character wear an information acquisition device as shown in FIG. In addition, an information acquisition device as shown in FIG. 2 or an information acquisition device without a microphone, a throat microphone, and an HMD is installed in an exhibit (such as a panel for a technical display) as shown in FIG. Has been. Also, in this specific example, information processing when the hero spends several hours while observing the panels of the exhibition, discussing with other people, or explaining the panel etc. to other people. A processing example of the apparatus will be described.

図１１は、本具体例において、主人公が保持している情報取得装置が、取得した映像、音声、オブジェクト識別子を取得する生コンテンツを示す。図１１のうちの映像は、主人公が保持している情報取得装置のＣＣＤカメラが取得した情報である。図１１の音声は、当該情報取得装置のマイクが取得する情報である。図１１のオブジェクト識別子は、当該情報取得装置の赤外線センサ（ＩＲトラッカ）が取得する情報である。図１１のオブジェクト識別子は、主人公が対向したオブジェクト（人や展示物）を示す。また、図１１において、映像、音声、オブジェクト識別子は、取得した時刻順に格納されており、それぞれ同期している。また、図１１において、ヘッダー情報として、情報取得装置の保持者（ここでは、主人公）を識別するオブジェクト識別子「３５」が格納されている。例えば、情報取得装置のＣＣＤカメラは、情報取得装置の保持者を識別するオブジェクト識別子と取得した映像を対にして送信する。また、例えば、情報取得装置のマイクは、情報取得装置の保持者を識別するオブジェクト識別子と取得した音声を対にして送信する。さらに、例えば、情報取得装置のＩＲトラッカは、情報取得装置の保持者を識別するオブジェクト識別子と取得したオブジェクト識別子を対にして送信する。図１１における生コンテンツは、例えば、情報取得装置から送信された情報の集合である。 FIG. 11 shows raw content in which the information acquisition device held by the main character acquires the acquired video, audio, and object identifier in this specific example. The video in FIG. 11 is information acquired by the CCD camera of the information acquisition device held by the main character. The audio | voice of FIG. 11 is the information which the microphone of the said information acquisition apparatus acquires. The object identifier in FIG. 11 is information acquired by the infrared sensor (IR tracker) of the information acquisition apparatus. The object identifier in FIG. 11 indicates an object (person or exhibit) that the main character faces. In FIG. 11, video, audio, and object identifiers are stored in order of acquired time, and are synchronized with each other. In FIG. 11, an object identifier “35” for identifying the holder of the information acquisition device (here, the main character) is stored as header information. For example, the CCD camera of the information acquisition device transmits an object identifier for identifying the holder of the information acquisition device and the acquired video as a pair. Further, for example, the microphone of the information acquisition device transmits the object identifier for identifying the holder of the information acquisition device and the acquired voice as a pair. Further, for example, the IR tracker of the information acquisition device transmits the object identifier for identifying the holder of the information acquisition device and the acquired object identifier as a pair. The raw content in FIG. 11 is a set of information transmitted from the information acquisition device, for example.

次に、図１２は、オブジェクト識別子「３８」で識別されるユーザが装着している情報取得装置が取得したコンテンツを示す。また、図１３は、オブジェクト識別子「０１」で識別される展示パネルに設置されている情報取得装置が取得したコンテンツを示す。図１３のコンテンツにおいて、音声の情報は存在しない。 Next, FIG. 12 shows the content acquired by the information acquisition device worn by the user identified by the object identifier “38”. FIG. 13 shows the content acquired by the information acquisition device installed in the display panel identified by the object identifier “01”. In the content of FIG. 13, there is no audio information.

さらに、図１４は、オブジェクト識別子「０１」で識別される展示パネルの概観図である。本展示パネルは、技術説明を表示するディスプレイを有し、ディスプレイの上部には、赤外線ＩＤタグと、ＩＲトラッカと、ＣＣＤカメラが設置されている。赤外線ＩＤタグは、本展示パネルを識別するオブジェクト識別子「０１」を重畳した赤外線信号を発信する。また、ＩＲトラッカは、本展示パネルを見ているユーザが装着している情報取得装置の赤外線ＩＤタグから発信されたオブジェクト識別子を取得する。さらに、ＣＣＤカメラは、本展示パネルを見ているユーザ等を撮影し、映像を取得する。
そして、上記の各情報取得装置が取得したコンテンツを、図示しない手段（通信手段、放送手段、記録媒体など）により取得し、情報処理装置はコンテンツ格納部１０１に当該コンテンツを格納している、とする。
かかる場合、所定のタイミングになった場合、またはユーザの指示により、本情報処理装置の発話箇所検出部１０２は、主人公（オブジェクト識別子「３５」）のコンテンツ（図１１のコンテンツ）を取得する。 Further, FIG. 14 is an overview of the display panel identified by the object identifier “01”. This exhibition panel has a display for displaying technical explanations, and an infrared ID tag, an IR tracker, and a CCD camera are installed on the top of the display. The infrared ID tag transmits an infrared signal on which an object identifier “01” for identifying the display panel is superimposed. Further, the IR tracker acquires an object identifier transmitted from an infrared ID tag of an information acquisition device worn by a user who is viewing this display panel. Further, the CCD camera captures images of a user or the like who is looking at the display panel and acquires a video.
The content acquired by each of the information acquisition devices is acquired by means (not shown) (communication means, broadcast means, recording medium, etc.), and the information processing apparatus stores the content in the content storage unit 101. To do.
In this case, the utterance location detection unit 102 of the information processing apparatus acquires the content of the main character (object identifier “35”) (the content of FIG. 11) when the predetermined timing comes or according to a user instruction.

次に、発話箇所検出部１０２は、上述した動作により、図１１のコンテンツから発話箇所を検出する。つまり、発話箇所検出部１０２は、「ｔ＝０」から「ｔ＝８０」、「ｔ＝３５０」から「ｔ＝７２０」等を取得する。そして、発話箇所検出部１０２は、「ｔ＝０」から「ｔ＝８０」の音声と対になる他人のオブジェクト識別子「３８」を取得する。次に、他人のオブジェクト識別子「３８」で識別されるオブジェクトのコンテンツ「ｔ＝０」から「ｔ＝８０」の区間に、ほぼ連続する区間において、発声しているか否かを判断する。ここでは、「ｔ＝０」から「ｔ＝８０」の区間に、ほぼ連続する区間である区間「ｔ＝７５」から「ｔ＝２００」において、オブジェクト識別子「３８」に対応する音声は、発声していると判断される。次に、発話箇所検出部１０２は、区間「ｔ＝７５」から「ｔ＝２００」にほぼ連続する区間において、オブジェクト識別子「３５」に対応する音声は、発声していないと判断する。以上より、発話箇所検出部１０２は、区間「ｔ＝０」（これを始点という）から「ｔ＝２００」（これを終点という）を取得する。そして、かかる区間において、二人の会話がなされているので、場面種決定部１０６は、場面種を「対話」と決定する。
次に、発話箇所検出部１０２は、図１１のコンテンツから検出した区間「ｔ＝３５０」から「ｔ＝７２０」において、対応するオブジェクト識別子「４０」を図１１のコンテンツから取得する。 Next, the utterance part detection unit 102 detects the utterance part from the content of FIG. 11 by the above-described operation. That is, the utterance part detection unit 102 acquires “t = 0” to “t = 80”, “t = 350” to “t = 720”, and the like. Then, the utterance part detection unit 102 acquires the object identifier “38” of another person who is paired with the voice of “t = 0” to “t = 80”. Next, it is determined whether or not the utterance is made in a section that is substantially continuous with the section of content “t = 0” to “t = 80” of the object identified by the object identifier “38” of another person. Here, in the sections “t = 75” to “t = 200” that are substantially continuous sections from “t = 0” to “t = 80”, the voice corresponding to the object identifier “38” is uttered. It is judged that Next, the utterance point detection unit 102 determines that the voice corresponding to the object identifier “35” is not uttered in a section substantially continuous from the section “t = 75” to “t = 200”. As described above, the utterance point detection unit 102 acquires “t = 200” (this is called the end point) from the section “t = 0” (this is called the start point). In this section, since the conversation between the two people is made, the scene type determination unit 106 determines the scene type as “dialog”.
Next, the utterance point detection unit 102 acquires the corresponding object identifier “40” from the content of FIG. 11 in the section “t = 350” to “t = 720” detected from the content of FIG.

そして、発話箇所検出部１０２は、オブジェクト識別子「４０」に対応するコンテンツ（図示しない）から、区間「ｔ＝３５０」から「ｔ＝７２０」にほぼ連続する発声が存在しないことを検出する。したがって、発話箇所検出部１０２は、区間「ｔ＝３５０」（始点）、「ｔ＝７２０」（終点）を取得する。次に、場面種決定部１０６は、場面種を「発話」と決定する。
以上の処理と同様に、発話箇所検出部１０２は、図１１のコンテンツに基づいて、主人公が関与する「対話」または「発話」の箇所を検出する。
次に、対向箇所検出部１０３は、上述した処理に基づいて、図１１のコンテンツのうち、対向する箇所を検出する。 Then, the utterance part detection unit 102 detects from the content (not shown) corresponding to the object identifier “40” that there is no utterance substantially continuous from the section “t = 350” to “t = 720”. Therefore, the utterance location detection unit 102 acquires the sections “t = 350” (start point) and “t = 720” (end point). Next, the scene type determination unit 106 determines the scene type as “utterance”.
Similarly to the above processing, the utterance part detection unit 102 detects a “dialogue” or “speech” part in which the main character is involved, based on the content of FIG.
Next, the facing location detection unit 103 detects a facing location in the content of FIG. 11 based on the processing described above.

まず、対向箇所検出部１０３は、主人公のオブジェクト識別子「３５」に対応するコンテンツが有するオブジェクト識別子を検査し、所定時間以上、ほぼ連続して取得したオブジェクト識別子、およびその時間に関する情報である時間情報の組をすべて取得する。ここで、対向箇所検出部１０３は、例えば、区間「ｔ＝３５００」（始点）、「ｔ＝３８２０」（終点）を検出する。図１１において、かかる区間で、オブジェクト識別子「０１」が所定時間以上、間をおかずに連続して存在する。したがって、主人公は、オブジェクト識別子「０１」で識別されるオブジェクト（展示パネル）を、区間「ｔ＝３５００」（始点）、「ｔ＝３８２０」（終点）の間、連続して見学していたことが分かる。そして、対向箇所検出部１０３は、「ｔ＝３５００」（始点）、「ｔ＝３８２０」（終点）の情報を取得する。次に、場面種決定部１０６は、場面種を「対向」と決定する。その他の区間においても、対向箇所検出部１０３は、図１１のコンテンツに基づいて、「対向」の箇所を検出することは言うまでもない。なお、対向箇所検出部１０３は、上記区間において、図１３に示すオブジェクト識別子「０１」の展示パネルに設置された情報取得装置が取得したオブジェクト識別子の中に、主人公のオブジェクト識別子「３５」が、ほぼ連続的に含まれるか否かをも判断して、「対向」の区間であると判断しても良い。 First, the opposite location detection unit 103 checks the object identifier included in the content corresponding to the object identifier “35” of the main character, and obtains the object identifier acquired almost continuously for a predetermined time or more, and time information that is information about the time. Get all pairs. Here, the facing location detection unit 103 detects, for example, a section “t = 3500” (start point) and “t = 3820” (end point). In FIG. 11, in such a section, the object identifier “01” is continuously present for a predetermined time or more. Therefore, the main character has continuously observed the object (exhibition panel) identified by the object identifier “01” during the section “t = 3500” (start point) and “t = 3820” (end point). I understand. Then, the facing location detection unit 103 acquires information of “t = 3500” (start point) and “t = 3820” (end point). Next, the scene type determination unit 106 determines the scene type as “opposite”. Also in other sections, it is needless to say that the facing portion detection unit 103 detects the “facing” portion based on the content of FIG. 11. In the above section, the opposite location detection unit 103 includes the object identifier “35” of the main character among the object identifiers acquired by the information acquisition device installed on the display panel of the object identifier “01” shown in FIG. It may also be determined whether or not the segment is “opposite” by determining whether or not it is included substantially continuously.

以上の処理により、発話箇所検出部１０２、および対向箇所検出部１０３は、図１１のコンテンツに基づいて、主人公が関与する「対話」、「発話」、および「対向」の箇所を検出し、場面種決定部１０６は、場面種を決定した。その検出した箇所、および場面種を示す情報を図１５の箇所管理表に示す。箇所管理表は、発話箇所検出部１０２、および対向箇所検出部１０３が検出した箇所に関する情報、場面種決定部１０６が決定した場面種を、少なくとも一時的に保持している表である。箇所管理表は、「ＩＤ」「オブジェクト識別子」「始点（ｔ）」「終点（ｔ）」「場面種」を有するレコードを１以上格納している。「ＩＤ」は、レコードを識別する情報を識別するオブジェクト識別子である。 Through the processing described above, the utterance point detection unit 102 and the opposite point detection unit 103 detect “dialogue”, “speech”, and “opposite” points where the main character is involved based on the content shown in FIG. The seed determination unit 106 determines the scene type. Information indicating the detected location and scene type is shown in the location management table of FIG. The location management table is a table that at least temporarily holds information related to locations detected by the speech location detection unit 102 and the opposing location detection unit 103 and the scene type determined by the scene type determination unit 106. The location management table stores one or more records having “ID”, “object identifier”, “start point (t)”, “end point (t)”, and “scene type”. “ID” is an object identifier for identifying information for identifying a record.

次に、出力部１０８は、出力する箇所の数「２１８」、各箇所のデータサイズ（区間の時間的長さ）に基づいて、出力の際のコマ割りを決定する。具体的には、出力部１０８は、図１６の多数のコマ割情報を保持している。コマ割情報は、画面を構成するコマのウィンドウ情報（ウィンドウの属性値［位置、サイズ、背景色など］）でも良いし、図１６の各コマ割を示すビットマップデータ等でも良く、そのデータ構造は問わない。 Next, the output unit 108 determines the frame division at the time of output based on the number “218” of locations to be output and the data size (time length of the interval) of each location. Specifically, the output unit 108 holds a large number of frame allocation information shown in FIG. The frame division information may be window information of the frames constituting the screen (window attribute values [position, size, background color, etc.]), bitmap data indicating each frame division in FIG. Does not matter.

また、出力部１０８は、図１７に示す、コマ割を決定するための情報であるコマ割決定情報を保持している。コマ割決定情報は、「ＩＤ」「コマ割識別子」「箇所数」「最大データ長（ｔ）」を有するレコードを１以上保持している。「コマ割識別子」は、コマ割情報を識別する情報であり、その値が「ａ」である場合は、コマ割は図１６（ａ）のコマ割に決定される。また、「箇所数」は、出力する対象の箇所の数の条件である。「最大データ長（ｔ）」は、出力する対象の箇所のデータ長の中で最大のデータ長（単位は、ｔ（秒））の条件である。 Further, the output unit 108 holds frame allocation determination information, which is information for determining frame allocation, shown in FIG. The frame allocation determination information holds one or more records having “ID”, “frame allocation identifier”, “number of locations”, and “maximum data length (t)”. The “frame allocation identifier” is information for identifying frame allocation information. When the value is “a”, the frame allocation is determined to be the frame allocation in FIG. The “number of places” is a condition for the number of places to be output. “Maximum data length (t)” is a condition of the maximum data length (unit: t (seconds)) among the data lengths of the portions to be output.

かかる場合、出力部１０８は、図１５の箇所管理表に基づいて、図１７のコマ割決定情報を参照し、コマ割情報を選択する。ここでは、出力部１０８は、図１５の箇所管理表の箇所数「２１８」を取得する。そして、図１５の箇所管理表の各箇所に対応する最大データ長「３２０」を取得する、とする。なお、最大データ長「３２０」は、図１５の「ＩＤ＝１５」のレコードの「終点―始点」の値である。
そして、出力部１０８は、箇所数「２１８」、最大データ長「３２０」を満たすコマ割識別子「ａ」を取得する。そして、出力部１０８は、図１６から、コマ割情報（ａ）を選択する。 In such a case, the output unit 108 selects frame allocation information with reference to the frame allocation determination information in FIG. 17 based on the location management table in FIG. Here, the output unit 108 acquires the number of places “218” in the place management table of FIG. 15. Then, the maximum data length “320” corresponding to each location in the location management table of FIG. 15 is acquired. The maximum data length “320” is a value of “end point−start point” of the record of “ID = 15” in FIG.
Then, the output unit 108 acquires the frame allocation identifier “a” that satisfies the number of places “218” and the maximum data length “320”. Then, the output unit 108 selects the frame allocation information (a) from FIG.

次に、静止画出力手段１０８１は、図１５の箇所管理表の各箇所のコンテンツが有する映像のうち、一の静止画を取得する。ここで、静止画出力手段１０８１は、たとえば、各箇所に対応する映像の中で、ほぼ真ん中の静止画を取得する、とする。なお、静止画出力手段１０８１が、映像の中の静止画を取得するアルゴリズムは問わない。ここで、静止画出力手段１０８１が各箇所に対応する映像から１つずつ取得した静止画の例を図１８に示す。図１８（ａ）は、図１５の「ＩＤ＝１」の箇所に対応する静止画である。図１８（ｂ）は、図１５の「ＩＤ＝２」の箇所に対応する静止画である。図１８（ｃ）は、図１５の「ＩＤ＝１５」の箇所に対応する静止画である。 Next, the still image output unit 1081 acquires one still image among the videos included in the content of each location in the location management table of FIG. Here, it is assumed that the still image output unit 1081 acquires a still image that is substantially in the middle of the video corresponding to each location, for example. It should be noted that the still image output means 1081 can use any algorithm for acquiring a still image in the video. Here, an example of a still image acquired by the still image output unit 1081 one by one from the video corresponding to each location is shown in FIG. FIG. 18A shows a still image corresponding to the location “ID = 1” in FIG. FIG. 18B is a still image corresponding to the location of “ID = 2” in FIG. FIG. 18C is a still image corresponding to the location “ID = 15” in FIG.

次に、吹出出力手段１０８２は、各静止画に付加する吹き出しを構成する。次に、吹き出しを構成する処理の例について述べる。まず、吹出出力手段１０８２は、図１９に示す吹き出し管理表を保持している。吹き出し管理表は、「ＩＤ」「吹き出し形状」「音声長（ｔ）」を有するレコードを１以上格納している。「吹き出し形状」は、吹き出しの形状を示す属性であり、その属性値はビットマップでも、グラフィカルデータ等でも良い。また、「音声長（ｔ）」は、音声の長さを示す、ここでは時間（ｔ）である。つまり、発声している時間の長さに応じて、吹き出しの形状が変わる。 Next, the balloon output means 1082 constitutes a balloon to be added to each still image. Next, an example of processing that constitutes a balloon will be described. First, the blowout output means 1082 holds a balloon management table shown in FIG. The balloon management table stores one or more records having “ID”, “balloon shape”, and “voice length (t)”. The “balloon shape” is an attribute indicating the shape of the balloon, and the attribute value may be a bitmap or graphical data. “Speech length (t)” is the time (t) in this case, indicating the length of the speech. That is, the shape of the balloon changes according to the length of time during which the utterance is made.

まず、位置検出部１０５は、取得した静止画、およびその前後の静止画（つまり、映像）を分析し、オブジェクトの少なくとも周辺の位置を検出する。オブジェクトは、通常、赤外線を発信しており、位置検出部１０５は、画像処理により赤外線を発信している位置（座標）を取得する。ここで、例えば、図１８（ａ）の静止画、およびその前後の静止画に基づいて、位置検出部１０５は、表示されているオブジェクトの位置座標（２５０，３２３）を取得した、とする。 First, the position detection unit 105 analyzes the acquired still image and the still images before and after that (that is, a video), and detects at least the peripheral position of the object. The object normally transmits infrared rays, and the position detection unit 105 acquires a position (coordinates) where infrared rays are transmitted by image processing. Here, for example, it is assumed that the position detection unit 105 acquires the position coordinates (250, 323) of the displayed object based on the still image of FIG.

次に、吹出出力手段１０８２は、取得した静止画に対応する音声長「２００−０＝２００」を取得する。次に、吹出出力手段１０８２は、音声長（ｔ）「２００」に対応する吹き出しの種類を図１９の「ＩＤ＝２」の吹き出しと決定する。
次に、吹出出力手段１０８２は、音声長（ｔ）「２００」に応じた文字列を生成する。ここでは、「２００／４＝５０文字」の文字列「＃ロ・・ＸＹ・・・Ｚ・・・・」生成する、とする。
次に、吹出出力手段１０８２は、図１９の「ＩＤ＝２」の吹き出し形状、５０文字の文字列「＃ロ・・ＸＹ・・・Ｚ・・・・」に基づいて、吹き出し画像を生成する。 Next, the blowout output unit 1082 acquires the audio length “200-0 = 200” corresponding to the acquired still image. Next, the blowout output means 1082 determines the type of the balloon corresponding to the voice length (t) “200” as the balloon of “ID = 2” in FIG.
Next, the blowout output means 1082 generates a character string corresponding to the voice length (t) “200”. Here, it is assumed that a character string “#B ··· XY...
Next, the blowout output unit 1082 generates a balloon image based on the balloon shape of “ID = 2” and the character string “# B · XY ... Z ···” in FIG. .

次に、吹出出力手段１０８２は、吹き出しの表示位置を、表示されているオブジェクトの位置座標（２５０，３２３）として決定する。なお、吹出出力手段１０８２は、吹き出しの本体（文字列が入る空間）を、位置座標に対して、画面の空いている側に配置することは好適である。
次に、吹出出力手段１０８２は、静止画を有する箇所に対応する主人公の音声が存在するか否かを判断し、音声が存在する（発声されている）場合には、その音声長を取得する。 Next, the balloon output means 1082 determines the display position of the balloon as the position coordinates (250, 323) of the displayed object. Note that it is preferable that the blowout output means 1082 is arranged such that the main body of the balloon (a space in which the character string enters) is arranged on the vacant side of the screen with respect to the position coordinates.
Next, the blowout output means 1082 determines whether or not the main character's voice corresponding to the part having the still image exists, and if the voice exists (spoken), acquires the voice length. .

そして、吹出出力手段１０８２は、オブジェクトの位置座標をコマの外部の（０，５００）とする。そして、吹出出力手段１０８２は、その音声長に基づいた文字列を生成し、上記と同様に、吹き出し画像を生成する。
同様に、吹出出力手段１０８２、および位置検出部１０５は、処理を行う、図１８（ｂ）、（ｃ）の静止画に対応する吹き出し画像を生成する。
そして、かかる静止画と吹き出し画像を合成し、図２０（ａ）（ｂ）（ｃ）に示す画像を生成する。 The blowout output unit 1082 sets the position coordinates of the object to (0,500) outside the frame. The balloon output means 1082 generates a character string based on the voice length, and generates a balloon image in the same manner as described above.
Similarly, the blowout output unit 1082 and the position detection unit 105 perform processing to generate a balloon image corresponding to the still image of FIGS. 18B and 18C.
Then, the still image and the balloon image are combined to generate the images shown in FIGS. 20 (a), (b), and (c).

次に、出力部１０８は、以下に示す処理により、静止画（例えば、図２０（ａ）（ｂ）（ｃ））に対してハイライト処理を行う。ここで、出力部１０８は、図２１に示すハイライト図形管理表を保持している。ハイライト図形管理表は、「ＩＤ」「場面種」「ハイライト図形」を有するレコードを１以上、保持している。「場面種」は場面の種類、「ハイライト図形」は、場面種に応じたハイライトの図形を示す。ハイライト図形は、ここでは、円、正方形、長方形であるが、他の図形でも良いことは言うまでもない。また、「ハイライト図形」の属性値のデータ構造は問わない。 Next, the output unit 108 performs highlight processing on a still image (for example, FIGS. 20A, 20B, and 20C) by the following processing. Here, the output unit 108 holds the highlight graphic management table shown in FIG. The highlight graphic management table holds one or more records having “ID”, “scene type”, and “highlight graphic”. “Scene type” indicates the type of scene, and “highlight figure” indicates a highlight figure corresponding to the scene type. The highlight figure here is a circle, a square, or a rectangle, but it goes without saying that other figures may be used. Further, the data structure of the attribute value of “highlight figure” does not matter.

次に、位置検出部１０５は、出力する静止画上のオブジェクトの位置を取得する。次に、出力部１０８は、静止画の場面種に応じたハイライト図形を取得する。つまり、出力部１０８は、図２０（ａ）の静止画の場面種「対話」に応じた正方形のハイライト図形を、図２１のハイライト図形管理表から取得する。 Next, the position detection unit 105 acquires the position of the object on the still image to be output. Next, the output unit 108 acquires a highlight graphic corresponding to the scene type of the still image. That is, the output unit 108 acquires a square highlighted graphic corresponding to the still image scene type “dialog” in FIG. 20A from the highlighted graphic management table in FIG.

次に、出力部１０８は、図２０（ａ）の静止画に対して、オブジェクトの位置を中心として、正方形のハイライト処理を行う。正方形の大きさは、予め決められても良いし、コマの大きさに応じて、動的に変更しても良い。ハイライト処理とは、例えば、ハイライト図形の内側をカラー画像に、外側をモノクロ画像にする処理である。その他、ハイライト処理は、ハイライト図形の内側をフルカラー表示に、外側を１６ビットのカラー画像表示にするなどしても良い。つまり、ハイライト処理とは、ハイライト図形の内側を目立つ態様にする処理を言う。 Next, the output unit 108 performs a square highlight process on the still image of FIG. The size of the square may be determined in advance or may be dynamically changed according to the size of the frame. The highlight process is, for example, a process of making the inside of a highlight figure a color image and the outside a monochrome image. In addition, the highlight processing may be performed such that the inside of the highlight figure is displayed in full color and the outside is displayed in 16-bit color image. That is, the highlight process is a process for making the inside of a highlight figure stand out.

同様に、出力部１０８は、図２０（ｂ）（ｃ）の静止画に対して、オブジェクトの位置を中心として、場面種に対応する図形の形状で、ハイライト処理を行う。その結果、出力部１０８は、図２２（ａ）（ｂ）（ｃ）の静止画を得る。 Similarly, the output unit 108 performs a highlight process on the still image in FIGS. 20B and 20C with the shape of the figure corresponding to the scene type centered on the position of the object. As a result, the output unit 108 obtains the still images shown in FIGS. 22 (a), 22 (b) and 22 (c).

そして、出力部１０８は、各静止画が対応するコマ（ウィンドウ）に表示する。なお、出力部１０８は、静止画が対応する箇所の時間が長ければ、当該静止画を大きなサイズのコマに配置するようなアルゴリズムが好適である。 Then, the output unit 108 displays the frame (window) corresponding to each still image. Note that the output unit 108 is preferably an algorithm that arranges the still image in a large-sized frame if the time corresponding to the still image is long.

また、出力部１０８は、図１５に示す「２１８」の箇所に対応する静止画を、上記の処理により順次、抽出、かつ生成し、図１６（ａ）のコマ割情報が有するコマに配置する。つまり、出力部１０８は、コマにある静止画を表示された後、所定の時間（時間はランダムでも良い）、表示を継続し、その後、他の静止画を上書き表示する。したがって、ユーザは、ぱらぱら静止画を見ることとなる。つまり、静止画抽出部１０４は、発話箇所または／および対向箇所を構成する１以上の静止画を映像から繰り返し抽出し、出力部１０８は、１以上のウィンドウに、静止画抽出部１０４が抽出した静止画を切り換えながら出力する。
なお、かかる場合の、全体の表示イメージを図２３、２４に示す。図２３において、コマとコマの間に、漫画特有の空きがある。図２４には、その空きがない態様である。
以上、本実施の形態によれば、動画から静止画を抽出して、漫画的に表示できる。かつ、漫画的な表示を構成するコマ内の静止画を切り替えるような効果的な表示を実現できる。
さらに、例えば、技術等の各種展示会や博物館や美術館などをめぐったり、さらに、観光したりした際に、取得したコンテンツ（映像、音声などを含む）から、自動的に、好適な電子アルバムを構成できる。 Further, the output unit 108 sequentially extracts and generates still images corresponding to the location “218” shown in FIG. 15 by the above processing, and arranges them on the frames included in the frame allocation information of FIG. . That is, the output unit 108 displays the still image in the frame, continues the display for a predetermined time (the time may be random), and then displays another still image by overwriting. Therefore, the user sees the still image. That is, the still image extraction unit 104 repeatedly extracts one or more still images constituting the utterance location and / or the opposite location from the video, and the output unit 108 extracts the one or more windows to the still image extraction unit 104. Output while switching still images.
In this case, the entire display image is shown in FIGS. In FIG. 23, there is a space unique to comics between frames. FIG. 24 shows a mode in which there is no space.
As described above, according to the present embodiment, a still image can be extracted from a moving image and displayed in a comic style. In addition, it is possible to realize an effective display such as switching still images in frames constituting a comic display.
In addition, for example, when visiting various exhibitions such as technology, museums and art museums, and further sightseeing, a suitable electronic album is automatically created from the acquired content (including video, audio, etc.) Can be configured.

さらに、本実施の形態における処理は、ソフトウェアで実現しても良い。そして、このソフトウェアをソフトウェアダウンロード等により配布しても良い。また、このソフトウェアをＣＤ−ＲＯＭなどの記録媒体に記録して流布しても良い。なお、このことは、本明細書における他の実施の形態においても該当する。なお、本実施の形態における情報処理装置を実現するソフトウェアは、以下のようなプログラムである。つまり、このプログラムは、コンピュータに、映像および音声を有するコンテンツが有する音声に基づいて、発話している箇所である発話箇所を検出する発話箇所検出ステップと、前記発話箇所検出ステップで検出した発話箇所を構成する１以上の静止画を前記映像から抽出する静止画抽出ステップと、前記静止画抽出ステップで抽出した１以上の静止画を１以上のほぼ重ならないウィンドウに出力する出力ステップを実行させるためのプログラム、である。 Furthermore, the processing in the present embodiment may be realized by software. Then, this software may be distributed by software download or the like. Further, this software may be recorded and distributed on a recording medium such as a CD-ROM. This also applies to other embodiments in this specification. Note that the software that implements the information processing apparatus according to the present embodiment is the following program. In other words, the program includes, on the computer, an utterance location detecting step for detecting an utterance location that is an utterance location based on the voice of content having video and audio, and the utterance location detected in the utterance location detection step. A still image extraction step for extracting one or more still images constituting the image from the video, and an output step for outputting the one or more still images extracted in the still image extraction step to one or more substantially non-overlapping windows. Program.

また、このプログラムは、コンピュータに、格納しているコンテンツ中の箇所であり、２以上のオブジェクトが対向している箇所を検知する対向箇所検出ステップと、前記対向箇所検出ステップで検出した対向箇所を構成する映像の中から、１以上の静止画を抽出する静止画抽出ステップと、前記静止画抽出ステップで抽出した１以上の静止画を１以上のほぼ重ならないウィンドウに出力する出力ステップを実行させるためのプログラムである。 In addition, the program stores a location in the content stored in the computer, an opposing location detection step for detecting a location where two or more objects are opposed, and an opposing location detected in the opposing location detection step. A still image extracting step for extracting one or more still images from the constituting video and an output step for outputting the one or more still images extracted in the still image extracting step to one or more windows that do not substantially overlap each other are executed. It is a program for.

また、本プログラムにおける前記静止画抽出ステップにおいて、発話箇所または／および対向箇所を構成する１以上の静止画を前記映像から繰り返し抽出し、前記出力ステップにおいて、前記１以上のウィンドウに、前記静止画抽出ステップで抽出した静止画を切り換えながら出力することは好適である。 Further, in the still image extraction step in the program, one or more still images constituting an utterance location or / and an opposite location are repeatedly extracted from the video, and in the output step, the still image is displayed in the one or more windows. It is preferable to output while switching the still images extracted in the extraction step.

また、本プログラムにおける前記出力ステップは、前記静止画抽出ステップで抽出した静止画を出力する静止画出力サブステップと、前記静止画に重ねて吹き出しを出力する吹出出力サブステップを具備することは好適である。
また、上記プログラムにおける吹出出力サブステップにおいて、前記静止画を有する映像に対応する音声を分析し、当該分析結果に応じて２種類以上の形状の吹き出しを区別して出力することは好適である。 In addition, it is preferable that the output step in the program includes a still image output substep for outputting the still image extracted in the still image extraction step, and a blowout output substep for outputting a balloon over the still image. It is.
Further, in the blowout output sub-step in the program, it is preferable to analyze the audio corresponding to the video having the still image and distinguish and output two or more types of balloons according to the analysis result.

また、上記プログラムにおける吹出出力サブステップにおいて、前記静止画を有する映像に対応する音声を分析し、発話の長さを取得し、当該長さに応じた長さを有する文字列を吹き出し内に出力することは好適である。 Also, in the blowing output sub-step in the above program, the voice corresponding to the video having the still image is analyzed, the length of the utterance is acquired, and the character string having a length corresponding to the length is output in the balloon. It is preferable to do.

また、上記プログラムは、コンピュータに、前記映像を分析し、オブジェクトの少なくとも周辺の位置を検出する位置検出ステップをさらに実行させ、前記吹出出力サブステップにおいて、前記位置検出ステップで検出した位置周辺に吹き出しを出力することは好適である。 Further, the program causes the computer to further execute a position detection step of analyzing the video and detecting a position of at least the periphery of the object, and in the blowing output sub-step, the balloon is blown around the position detected in the position detection step. Is preferably output.

また、上記プログラムは、コンピュータに、前記発話箇所または前記対向箇所を構成するコンテンツを分析し、出力する静止画に対応する音声が、静止画に現れるユーザではないユーザの音声であることを検出する非表示ユーザ発声検出ステップをさらに実行させ、前記吹出出力サブステップにおいて、前記非表示ユーザ発声検出ステップで静止画に現れるユーザではないユーザの音声であることを検出した場合、前記ウィンドウの外または隅から吹き出しが現れる態様で吹き出しを出力することは好適である。 Further, the program analyzes the content constituting the utterance location or the opposite location on a computer, and detects that the audio corresponding to the output still image is the audio of a user who is not a user who appears in the still image. If the non-display user utterance detection step is further executed and it is detected in the blowout output sub-step that the voice of a user who is not a user who appears in a still image is detected in the non-display user utterance detection step, the outside or corner of the window is detected. It is preferable to output the speech balloon in such a manner that the speech balloon appears.

また、上記プログラムは、コンピュータに、前記映像を分析し、オブジェクトの少なくとも周辺の位置を検出する位置検出部をさらに実行させ、前記出力ステップにおいて、前記位置検出ステップで検出した位置の周辺の領域と、他の領域とで出力態様を変更して、前記静止画を出力することは好適である。 Further, the program causes the computer to further execute a position detection unit that analyzes the video and detects a position of at least the periphery of the object, and in the output step, a region around the position detected in the position detection step It is preferable to output the still image by changing the output mode with other regions.

また、上記プログラムにおいて、前記映像情報は、映像と当該映像に表れるオブジェクトを識別するオブジェクト識別子を有し、前記音声情報は、音声と音声の発話者を識別するオブジェクト識別子を有し、前記発話箇所検出ステップにおいて、前記音声の大きさが所定以上の大きさの箇所である発話箇所であり、一のオブジェクト識別子と対になる音声と、ほぼ連続する他のオブジェクト識別子と対になる音声を有する対話の箇所である発話箇所を検出することは好適である。 In the above program, the video information includes an object identifier for identifying a video and an object appearing in the video, and the audio information includes an object identifier for identifying a voice and a voice speaker. In the detecting step, the dialogue is an utterance portion where the volume of the voice is a predetermined size or more, and has a voice paired with one object identifier and a voice paired with another substantially continuous object identifier. It is preferable to detect the utterance location that is the location of.

また、上記プログラムにおいて、前記対話の箇所である発話箇所を構成するコンテンツを分析し、場面の種類を決定する場面種決定をさらに実行させ、前記出力ステップにおいて、前記場面の種類に基づいて、前記位置検出ステップで検出した位置の周辺の領域の形状が異なることは好適である。 Further, in the above program, the content constituting the utterance location that is the location of the dialogue is analyzed, and scene type determination for determining the type of scene is further executed, and in the output step, based on the type of scene, It is preferable that the shape of the area around the position detected in the position detection step is different.

また、上記各実施の形態において、各処理（各機能）は、単一の装置（システム）によって集中処理されることによって実現されてもよく、あるいは、複数の装置によって分散処理されることによって実現されてもよい。 In each of the above embodiments, each process (each function) may be realized by centralized processing by a single device (system), or by distributed processing by a plurality of devices. May be.

また、図２５は、本明細書で述べたプログラムを実行して、上述した種々の実施の形態の情報処理装置を実現するコンピュータの外観を示す。上述の実施の形態は、コンピュータハードウェア及びその上で実行されるコンピュータプログラムで実現され得る。図２５は、このコンピュータシステム２５０の概観図であり、図２６は、システム２５０のブロック図である。 FIG. 25 shows the external appearance of a computer that executes the programs described in this specification to realize the information processing apparatuses according to the various embodiments described above. The above-described embodiments can be realized by computer hardware and a computer program executed thereon. FIG. 25 is an overview of the computer system 250, and FIG. 26 is a block diagram of the system 250.

図２５において、コンピュータシステム２５０は、ＦＤ（ＦｌｅｘｉｂｌｅＤｉｓｋ）ドライブ、ＣＤ−ＲＯＭ（ＣｏｍｐａｃｔＤｉｓｋＲｅａｄＯｎｌｙＭｅｍｏｒｙ）ドライブを含むコンピュータ２５１と、キーボード２５２と、マウス２５３と、モニタ２５４と、スピーカー２５５とを含む。 25, a computer system 250 includes a computer 251 including an FD (Flexible Disk) drive and a CD-ROM (Compact Disk Read Only Memory) drive, a keyboard 252, a mouse 253, a monitor 254, and a speaker 255. .

図２６において、コンピュータ２５１は、ＦＤドライブ２５１１、ＣＤ−ＲＯＭドライブ２５１２に加えて、ＣＰＵ（ＣｅｎｔｒａｌＰｒｏｃｅｓｓｉｎｇＵｎｉｔ）２５１３と、ＣＰＵ２５１３、ＣＤ−ＲＯＭドライブ２５１２及びＦＤドライブ２５１１に接続されたバス２５１４と、ブートアッププログラム等のプログラムを記憶するためのＲＯＭ（Ｒｅａｄ−ＯｎｌｙＭｅｍｏｒｙ）２５１５と、ＣＰＵ２５１３に接続され、アプリケーションプログラムの命令を一時的に記憶するとともに一時記憶空間を提供するためのＲＡＭ（ＲａｎｄｏｍＡｃｃｅｓｓＭｅｍｏｒｙ）２５１６と、アプリケーションプログラム、システムプログラム、及びデータを記憶するためのハードディスク２５１７とを含む。ここでは、図示しないが、コンピュータ２５１は、さらに、ＬＡＮへの接続を提供するネットワークカードを含んでも良い。 26, in addition to the FD drive 2511 and the CD-ROM drive 2512, the computer 251 includes a CPU (Central Processing Unit) 2513, a bus 2514 connected to the CPU 2513, the CD-ROM drive 2512, and the FD drive 2511, and a boot. A ROM (Read-Only Memory) 2515 for storing programs such as an up program, and a RAM (Random Access Memory) connected to the CPU 2513 for temporarily storing instructions of application programs and providing a temporary storage space 2516 and a hard disk 2517 for storing application programs, system programs, and data. Although not shown here, the computer 251 may further include a network card that provides connection to the LAN.

コンピュータシステム２５０に、上述した実施の形態の情報処理装置の機能を実行させるプログラムは、ＣＤ−ＲＯＭ２６０１、またはＦＤ２６０２に記憶されて、ＣＤ−ＲＯＭドライブ２５１２またはＦＤドライブ２５１１に挿入され、さらにハードディスク２５１７に転送されても良い。これに代えて、プログラムは、図示しないネットワークを介してコンピュータ２５１に送信され、ハードディスク２５１７に記憶されても良い。プログラムは実行の際にＲＡＭ２５１６にロードされる。プログラムは、ＣＤ−ＲＯＭ２６０１、ＦＤ２６０２またはネットワークから直接、ロードされても良い。 A program that causes the computer system 250 to execute the functions of the information processing apparatus according to the above-described embodiment is stored in the CD-ROM 2601 or the FD 2602, inserted into the CD-ROM drive 2512 or the FD drive 2511, and further stored in the hard disk 2517. May be forwarded. Alternatively, the program may be transmitted to the computer 251 via a network (not shown) and stored in the hard disk 2517. The program is loaded into the RAM 2516 when executed. The program may be loaded directly from the CD-ROM 2601, the FD 2602, or the network.

プログラムは、コンピュータ２５１に、上述した実施の形態の情報処理装置の機能を実行させるオペレーティングシステム（ＯＳ）、またはサードパーティープログラム等は、必ずしも含まなくても良い。プログラムは、制御された態様で適切な機能（モジュール）を呼び出し、所望の結果が得られるようにする命令の部分のみを含んでいれば良い。コンピュータシステム２５０がどのように動作するかは周知であり、詳細な説明は省略する。 The program does not necessarily include an operating system (OS) or a third-party program that causes the computer 251 to execute the functions of the information processing apparatus according to the above-described embodiment. The program only needs to include an instruction portion that calls an appropriate function (module) in a controlled manner and obtains a desired result. How the computer system 250 operates is well known and will not be described in detail.

なお、上記プログラムにおいて、情報を出力するステップなどでは、ハードウェアによって行われる処理、例えば、出力するステップにおけるモニタなどで行われる処理（ハードウェアでしか行われない処理）は含まれない。
また、上記プログラムを実行するコンピュータは、単数であってもよく、複数であってもよい。すなわち、集中処理を行ってもよく、あるいは分散処理を行ってもよい。
本発明は、以上の実施の形態に限定されることなく、種々の変更が可能であり、それらも本発明の範囲内に包含されるものであることは言うまでもない。 In the above program, the step of outputting information does not include processing performed by hardware, for example, processing performed by a monitor in the outputting step (processing performed only by hardware).
Further, the computer that executes the program may be singular or plural. That is, centralized processing may be performed, or distributed processing may be performed.
The present invention is not limited to the above-described embodiments, and various modifications are possible, and it goes without saying that these are also included in the scope of the present invention.

以上のように、本発明にかかる情報処理装置は、動画から所定の静止画を適切に抽出して出力できるという効果を有し、電子アルバム装置等として有用である。 As described above, the information processing apparatus according to the present invention has an effect of appropriately extracting and outputting a predetermined still image from a moving image, and is useful as an electronic album apparatus or the like.

実施の形態における情報処理装置のブロック図Block diagram of an information processing apparatus in an embodiment 同情報取得装置の例を示す図The figure which shows the example of the same information acquisition device 同情報処理装置の動作について説明するフローチャートFlow chart for explaining the operation of the information processing apparatus 同発話箇所を検出する処理を説明するフローチャートFlowchart explaining processing for detecting the same utterance location 同発話箇所を検出する処理を説明するフローチャートFlowchart explaining processing for detecting the same utterance location 同対向箇所を検出する処理を説明するフローチャートFlowchart explaining processing for detecting the opposite location 同コマ割りの決定処理を説明するフローチャートFlowchart explaining same frame division determination processing 同吹き出し処理を説明するフローチャートFlowchart explaining the balloon process 同出力画像生成処理を説明するフローチャートFlow chart explaining the output image generation processing 同２つの音声がほぼ連続するか否かの判断方法を説明する図The figure explaining the judgment method of whether the two said audio | voices are substantially continuous. 同コンテンツ例を示す図Diagram showing the same content example 同コンテンツ例を示す図Diagram showing the same content example 同コンテンツ例を示す図Diagram showing the same content example 同展示パネルの概観図Overview of the panel 同箇所管理表を示す図Figure showing the same location management table 同コマ割情報を示す図Figure showing the same frame allocation information 同コマ割決定情報を示す図The figure which shows the same frame allocation decision information 同静止画出力手段取得した静止画の例を示す図The figure which shows the example of the still image which the same image output means acquired 同吹き出し管理表を示す図Figure showing the balloon management table 同静止画と吹き出し画像を合成した画像の例を示す図The figure which shows the example of the picture which synthesizes the still picture and the balloon picture 同ハイライト図形管理表を示す図Figure showing the same highlight graphic management table 同出力部１０８が得る静止画の例を示す図The figure which shows the example of the still image which the output part 108 obtains 同全体の表示イメージの例を示す図Figure showing an example of the entire display image 同全体の表示イメージの例を示す図Figure showing an example of the entire display image 同情報処理装置を実現するコンピュータの外観を示す図The figure which shows the external appearance of the computer which implement | achieves the information processing apparatus 同コンピュータシステムのブロック図Block diagram of the computer system

Explanation of symbols

１０１コンテンツ格納部
１０２発話箇所検出部
１０３対向箇所検出部
１０４静止画抽出部
１０５位置検出部
１０６場面種決定部
１０７非表示ユーザ発声検出部
１０８出力部
１０８１静止画出力手段
１０８２吹出出力手段 DESCRIPTION OF SYMBOLS 101 Content storage part 102 Speech part detection part 103 Opposite part detection part 104 Still image extraction part 105 Position detection part 106 Scene type determination part 107 Non-display user utterance detection part 108 Output part 1081 Still picture output means 1082 Outlet output means

Claims

A content storage unit storing at least one content including video information including video and audio information including audio;
Based on the voice, an utterance location detector that detects an utterance location that is an utterance location,
A still image extraction unit for extracting one or more still images constituting the utterance location detected by the utterance location detection unit from the video;
An information processing apparatus comprising: an output unit that outputs one or more still images extracted by the still image extraction unit to one or more windows that do not substantially overlap.

A content storage unit storing at least one content including video information including video and audio information including audio;
A facing part detection unit that detects a part in the content and two or more objects are facing each other;
A still image extraction unit that extracts one or more still images from the video that configures the opposite location detected by the opposite location detection unit;
An information processing apparatus comprising: an output unit that outputs one or more still images extracted by the still image extraction unit to one or more windows that do not substantially overlap.

The still image extraction unit
One or more still images constituting the utterance location or / and the opposite location are repeatedly extracted from the video,
The output unit is
The information processing apparatus according to claim 1, wherein the still image extracted by the still image extraction unit is output to the one or more windows while being switched.

The output unit is
A still image output means for outputting the still image extracted by the still image extraction unit;
The information processing apparatus according to claim 1, further comprising a blowout output unit that outputs a blowout on the still image.

Further comprising a position detector for analyzing the video and detecting a position of at least the periphery of the object;
The blowout output means includes
The information processing apparatus according to claim 4, wherein a balloon is output around the position detected by the position detection unit.

A non-display user utterance detecting unit that analyzes the content constituting the utterance location or the opposite location and detects that the audio corresponding to the output still image is the audio of a user who is not a user appearing in the still image. And
The blowout output means includes
6. The speech balloon is output in such a manner that a speech balloon appears outside or at a corner of the window when the non-display user utterance detection unit detects a voice of a user who is not a user who appears in a still image. Information processing device.

The output unit is
The information processing apparatus according to claim 5, wherein an area around the position detected by the position detection unit is in color and the other area is monochrome, and outputs the still image.

On the computer,
An utterance location detection step of detecting an utterance location that is an utterance location based on the audio of content having video and audio,
A still image extraction step of extracting one or more still images constituting the utterance location detected in the utterance location detection step from the video;
A program for executing an output step of outputting one or more still images extracted in the still image extraction step to one or more windows that do not substantially overlap.

On the computer,
A facing location detection step of detecting a location in the stored content and a location where two or more objects are facing,
A still image extraction step of extracting one or more still images from the video constituting the opposite location detected in the opposite location detection step;
A program for executing an output step of outputting one or more still images extracted in the still image extraction step to one or more windows that do not substantially overlap.