JP2022019341A

JP2022019341A - Information processing apparatus, information processing method, and program

Info

Publication number: JP2022019341A
Application number: JP2020123121A
Authority: JP
Inventors: 英人榊間; Hideto Sakakima
Original assignee: Canon Inc
Current assignee: Canon Inc
Priority date: 2020-07-17
Filing date: 2020-07-17
Publication date: 2022-01-27

Abstract

PROBLEM TO BE SOLVED: To generate three-dimensional shape data of an object at a time different from the shooting time based on a shot image obtained by shooting the object.
SOLUTION: An information processing device has a shape representing a three-dimensional shape of an object at a predetermined shooting time based on a plurality of shot images obtained by shooting an object from different directions by a plurality of shooting devices. Generate data. Further, the information processing apparatus acquires the first posture information representing the posture of the object at the shooting time and the second posture information representing the posture of the object at a specific time different from the shooting time. Then, the information processing apparatus generates shape data representing the three-dimensional shape of the object at a specific time based on the first posture information and the second posture information and the shape data corresponding to the shooting time.
[Selection diagram] Fig. 9

Description

本発明は、複数の撮影画像を用いてオブジェクトの３次元モデルを生成する技術に関する。 The present invention relates to a technique for generating a three-dimensional model of an object using a plurality of captured images.

複数の撮影装置を異なる位置に設置して多視点で同期撮影し、当該撮影により得られた複数視点画像を用いて、任意の視点から見た光景を表す仮想視点画像を生成する技術がある。このような技術によれば、例えば、サッカーやバスケットボール等の試合のハイライトシーンやコンサート等を様々な角度から視聴することが可能となり、通常の画像と比較してユーザに高臨場感を与えることができる。 There is a technique in which a plurality of photographing devices are installed at different positions to perform synchronous photography from multiple viewpoints, and a virtual viewpoint image representing a scene viewed from an arbitrary viewpoint is generated using the multiple viewpoint images obtained by the photographing. With such technology, for example, it is possible to watch highlight scenes of games such as soccer and basketball, concerts, etc. from various angles, and give the user a high sense of presence as compared with ordinary images. Can be done.

仮想視点画像の生成方法としては、複数の撮影装置により撮影された画像を用いて撮影領域内のオブジェクトの三次元形状データを生成し、その三次元形状データを用いたレンダリング処理を行って仮想視点画像を生成する方法がある。また、特許文献１には、予め設定された調整可能な三次元オブジェクトテンプレートモデルを、複数のカメラ画像から得られたオブジェクト三次元情報に基づいて調整し、調整されたモデルに射影変換を施すことで仮想視点画像を生成することが記載されている。 As a method of generating a virtual viewpoint image, three-dimensional shape data of an object in the shooting area is generated using images taken by a plurality of shooting devices, and rendering processing is performed using the three-dimensional shape data to perform a virtual viewpoint. There is a way to generate an image. Further, in Patent Document 1, a preset adjustable three-dimensional object template model is adjusted based on object three-dimensional information obtained from a plurality of camera images, and the adjusted model is subjected to projective transformation. It is described that a virtual viewpoint image is generated in.

特開２０１６－１２６４２５号公報Japanese Unexamined Patent Publication No. 2016-126425

撮影装置は所定のフレームレートの撮影画像を生成するが、撮影画像のフレームに対応する時刻とは異なる時刻におけるオブジェクトの三次元形状データを生成することが求められる場合がある。例えば、撮影画像のフレームレートより高いフレームレートで画像を表示可能なデバイスで仮想視点画像を表示する場合に、高いフレームレートの仮想視点画像を用いると、滑らかな動画の再生が可能となる。また例えば、高いフレームレートの仮想視点画像をスロー再生することで、スロー動画を滑らかに再生することが可能となる。撮影画像のフレームレートより高いフレームレートの仮想視点画像を生成するためには、撮影画像のフレームに対応する時刻とは異なる時刻の三次元形状データを生成することが要求される。しかしながら、従来の方法では、撮影が行われていない時刻におけるオブジェクトの三次元形状データを取得することはできない。 Although the photographing device generates a captured image at a predetermined frame rate, it may be required to generate three-dimensional shape data of an object at a time different from the time corresponding to the frame of the captured image. For example, when a virtual viewpoint image is displayed on a device capable of displaying an image at a frame rate higher than the frame rate of the captured image, smooth playback of a moving image is possible by using a virtual viewpoint image having a high frame rate. Further, for example, by slow-playing a virtual viewpoint image having a high frame rate, it is possible to smoothly play a slow moving image. In order to generate a virtual viewpoint image having a frame rate higher than the frame rate of the captured image, it is required to generate three-dimensional shape data at a time different from the time corresponding to the frame of the captured image. However, with the conventional method, it is not possible to acquire the three-dimensional shape data of the object at the time when the shooting is not performed.

本発明は上記の課題に鑑みてなされたものであり、オブジェクトを撮影することにより得られる撮影画像に基づいて、撮影時刻とは異なる時刻におけるオブジェクトの三次元形状データを生成することを目的とする。 The present invention has been made in view of the above problems, and an object of the present invention is to generate three-dimensional shape data of an object at a time different from the shooting time based on a shot image obtained by shooting the object. ..

上記の課題を解決するため、本発明に係る情報処理装置は、例えば以下の構成を有する。すなわち、所定の撮影時刻において複数の撮影装置によりオブジェクトを異なる方向から撮影することで得られる複数の撮影画像に基づいて、前記所定の撮影時刻における前記オブジェクトの三次元形状を表す形状データを生成する第１生成手段と、前記所定の撮影時刻における前記オブジェクトの姿勢を表す第１姿勢情報と、前記所定の撮影時刻とは異なる特定時刻における前記オブジェクトの姿勢を表す第２姿勢情報と、を取得する取得手段と、前記取得手段により取得された前記第１姿勢情報及び前記第２姿勢情報と、前記生成手段により生成された前記形状データとに基づいて、前記特定時刻における前記オブジェクトの三次元形状を表す形状データを生成する第２生成手段と、を有する。 In order to solve the above problems, the information processing apparatus according to the present invention has, for example, the following configuration. That is, shape data representing the three-dimensional shape of the object at the predetermined shooting time is generated based on a plurality of shot images obtained by shooting the object from different directions by a plurality of shooting devices at a predetermined shooting time. The first generation means, the first posture information representing the posture of the object at the predetermined shooting time, and the second posture information representing the posture of the object at a specific time different from the predetermined shooting time are acquired. Based on the acquisition means, the first posture information and the second posture information acquired by the acquisition means, and the shape data generated by the generation means, the three-dimensional shape of the object at the specific time is obtained. It has a second generation means for generating the shape data to be represented.

本発明によれば、オブジェクトを撮影することにより得られる撮影画像に基づいて、撮影時刻とは異なる時刻におけるオブジェクトの三次元形状データを生成することができる。 According to the present invention, it is possible to generate three-dimensional shape data of an object at a time different from the shooting time based on the shot image obtained by shooting the object.

画像生成システムの構成例を示す図である。It is a figure which shows the configuration example of an image generation system. 画像生成装置の構成例を示す図である。It is a figure which shows the configuration example of an image generation apparatus. 三次元モデル及び姿勢情報について説明するための図である。It is a figure for demonstrating a three-dimensional model and posture information. 撮影画像、三次元モデル、及び姿勢情報の時刻関係を示す図である。It is a figure which shows the time relation of a photographed image, a three-dimensional model, and posture information. 画像生成装置による補間三次元モデルの生成処理を説明するためのフローチャートである。It is a flowchart for demonstrating the generation process of the interpolation 3D model by an image generation apparatus. ボーンモデルの補間方法について説明するための図である。It is a figure for demonstrating the interpolation method of a bone model. 補間姿勢情報を用いて補間三次元モデルを生成する処理を説明するためのフローチャートである。It is a flowchart for demonstrating the process of generating the interpolated three-dimensional model using the interpolated posture information. 三次元モデルの補間方法について説明するための図である。It is a figure for demonstrating the interpolation method of a three-dimensional model. 撮影画像と補間三次元モデルの時刻関係を示す図である。It is a figure which shows the time relationship between a photographed image and an interpolated three-dimensional model. 画像生成装置の動作について説明するためのフローチャートである。It is a flowchart for demonstrating operation of an image generator. 撮影画像と補間三次元モデルの時刻関係を示す図である。It is a figure which shows the time relationship between a photographed image and an interpolated three-dimensional model. ボーンモデルの補間方法について説明するための図である。It is a figure for demonstrating the interpolation method of a bone model.

［システム構成］
以下、本発明の実施形態について、図面を使用して詳細に説明する。図１は、画像生成システム１００の構成例を示す。画像生成システム１００は、複数の撮影装置による撮影に基づく複数の画像（複数視点画像）と、仮想的な視点位置及び視線方向とに基づいて、仮想視点からの見えを表す仮想視点画像を生成するシステムである。本実施形態における仮想視点画像は、自由視点映像とも呼ばれるものであるが、ユーザが自由に（任意に）指定した視点に対応する画像に限定されず、例えば複数の候補からユーザが選択した視点に対応する画像なども仮想視点画像に含まれる。また、本実施形態では仮想視点の指定がユーザ操作により行われる場合を中心に説明するが、仮想視点の指定が画像解析の結果等に基づいて自動で行われてもよい。画像生成システム１００は、動画を構成するフレームの画像としての静止画の仮想視点画像を所定のフレーム更新間隔で更新することで再生される、仮想視点の動画を生成する。以降の説明に於いては、特に断りがない限り、画像という文言が動画と静止画の両方の概念を含むものとして説明する。 [System configuration]
Hereinafter, embodiments of the present invention will be described in detail with reference to the drawings. FIG. 1 shows a configuration example of the image generation system 100. The image generation system 100 generates a virtual viewpoint image representing a view from a virtual viewpoint based on a plurality of images (multiple viewpoint images) taken by a plurality of photographing devices and a virtual viewpoint position and line-of-sight direction. It is a system. The virtual viewpoint image in the present embodiment is also called a free viewpoint image, but is not limited to an image corresponding to a viewpoint freely (arbitrarily) specified by the user, for example, a viewpoint selected by the user from a plurality of candidates. Corresponding images and the like are also included in the virtual viewpoint image. Further, in the present embodiment, the case where the virtual viewpoint is specified by the user operation will be mainly described, but the virtual viewpoint may be automatically specified based on the result of image analysis or the like. The image generation system 100 generates a virtual viewpoint image to be reproduced by updating a virtual viewpoint image of a still image as an image of a frame constituting the moving image at a predetermined frame update interval. In the following description, unless otherwise specified, the word "image" will be described as including the concepts of both moving images and still images.

また、本実施形態では、画像生成システム１００が仮想視点画像と仮想視点音声を含む仮想視点コンテンツを提供する例を中心に説明する。ただし、仮想視点コンテンツに音声が含まれていなくても良い。また、仮想視点コンテンツに含まれる音声が、仮想視点に最も近いマイクにより集音された音声であっても良い。また、本実施形態では、説明の簡略化のため、部分的に音声についての記載を省略しているが、基本的に画像と音声は共に処理されるものとする。 Further, in the present embodiment, an example in which the image generation system 100 provides virtual viewpoint contents including a virtual viewpoint image and a virtual viewpoint sound will be mainly described. However, the virtual viewpoint content does not have to include audio. Further, the sound included in the virtual viewpoint content may be the sound collected by the microphone closest to the virtual viewpoint. Further, in the present embodiment, for the sake of simplification of the explanation, the description about the sound is partially omitted, but basically both the image and the sound are processed.

画像生成システム１００は、センサシステム１１０ａ～センサシステム１１０ｚ、画像生成装置１２２、コントローラ１２３、スイッチングハブ１２１、エンドユーザ端末１２６、及びタイムサーバ１２７を有する。 The image generation system 100 includes a sensor system 110a to a sensor system 110z, an image generation device 122, a controller 123, a switching hub 121, an end user terminal 126, and a time server 127.

センサシステム１１０ａはマイク１１１ａ、カメラ１１２ａ、雲台１１３ａ、外部センサ１１４ａ、及びカメラアダプタ１２０ａを有する。なお、センサシステム１１０ａはこの構成に限定されるものではなく、少なくとも１台のカメラ１１２ａまたはマイク１１１ａを有していれば良い。また例えば、センサシステム１１０ａは１台のカメラアダプタ１２０ａと複数のカメラ１１２ａで構成されてもよいし、１台のカメラ１１２ａと複数のカメラアダプタ１２０ａで構成されてもよい。即ち、画像生成システム１００内の複数のカメラ１１２と複数のカメラアダプタ１２０はＮ対Ｍ（ＮとＭは共に１以上の整数）で対応する。また、センサシステム１１０ａは、マイク１１１ａ、カメラ１１２ａ、雲台１１３ａ、及びカメラアダプタ１２０ａ以外の装置を含んでいてもよい。また、カメラ１１２ａとカメラアダプタ１２０ａが一体となって構成されていてもよい。 The sensor system 110a includes a microphone 111a, a camera 112a, a pan head 113a, an external sensor 114a, and a camera adapter 120a. The sensor system 110a is not limited to this configuration, and may have at least one camera 112a or microphone 111a. Further, for example, the sensor system 110a may be composed of one camera adapter 120a and a plurality of cameras 112a, or may be composed of one camera 112a and a plurality of camera adapters 120a. That is, the plurality of cameras 112 and the plurality of camera adapters 120 in the image generation system 100 correspond to each other by N to M (N and M are both integers of 1 or more). Further, the sensor system 110a may include a device other than the microphone 111a, the camera 112a, the pan head 113a, and the camera adapter 120a. Further, the camera 112a and the camera adapter 120a may be integrally configured.

マイク１１１ａにより集音された音声と、カメラ１１２ａにより撮影された画像は、カメラアダプタ１２０ａを介し、スイッチングハブ１２１へ伝送される。なお、本実施形態では、カメラ１１２ａとカメラアダプタ１２０ａが分離された構成である例を示しているが、これらが同一筺体に一体化されていてもよい。その場合、マイク１１１ａは一体化されたカメラ１１２ａに内蔵されてもよいし、カメラ１１２ａの外部に接続されていてもよい。 The sound collected by the microphone 111a and the image captured by the camera 112a are transmitted to the switching hub 121 via the camera adapter 120a. Although the present embodiment shows an example in which the camera 112a and the camera adapter 120a are separated from each other, they may be integrated in the same housing. In that case, the microphone 111a may be built in the integrated camera 112a or may be connected to the outside of the camera 112a.

本実施形態では、センサシステム１１０ｂ～センサシステム１１０ｚは、センサシステム１１０ａと同様の構成である。ただしこれに限らず、それぞれのセンサシステム１１０が異なる構成でもよい。本実施形態において、センサシステム１１０ａからセンサシステム１１０ｚまでの２６セットのシステムを特に区別しない場合には、センサシステム１１０と記載する。センサシステム１１０内の装置についても同様に、特に区別しない場合には、マイク１１１、カメラ１１２、雲台１１３、外部センサ１１４、及びカメラアダプタ１２０と記載する。なお、図１ではセンサシステムが２６セットの例を示しているが、画像生成システム１００に含まれるセンサシステム１１０の数はこれに限定されない。 In the present embodiment, the sensor system 110b to the sensor system 110z have the same configuration as the sensor system 110a. However, the present invention is not limited to this, and each sensor system 110 may have a different configuration. In the present embodiment, when the 26 sets of systems from the sensor system 110a to the sensor system 110z are not particularly distinguished, they are referred to as the sensor system 110. Similarly, the device in the sensor system 110 is described as a microphone 111, a camera 112, a pan head 113, an external sensor 114, and a camera adapter 120, unless otherwise specified. Although FIG. 1 shows an example of 26 sets of sensor systems, the number of sensor systems 110 included in the image generation system 100 is not limited to this.

複数のセンサシステム１１０は、それぞれ１台ずつのカメラ１１２を有する。即ち、画像生成システム１００は、被写体を複数の方向から撮影するための複数の撮影装置としてのカメラ１１２を有する。複数のカメラ１１２により撮影される撮影領域は、例えばサッカーや空手などの競技が行われる競技場、もしくはコンサートや演技が行われる舞台などである。複数のカメラ１１２は、このような撮影領域を取り囲むようにそれぞれ異なる位置に設置され、同期して撮影を行う。なお、複数のカメラ１１２は撮影領域の全周にわたって設置されていなくてもよく、設置場所の制限等によっては撮影領域の周囲の一部にのみ設置されていてもよい。また、複数のカメラ１１２には、望遠カメラと広角カメラなど機能が異なる撮影装置が含まれていてもよい。 Each of the plurality of sensor systems 110 has one camera 112. That is, the image generation system 100 has a camera 112 as a plurality of photographing devices for photographing a subject from a plurality of directions. The shooting area photographed by the plurality of cameras 112 is, for example, a stadium where competitions such as soccer and karate are performed, or a stage where concerts and performances are performed. The plurality of cameras 112 are installed at different positions so as to surround such a shooting area, and shoot in synchronization with each other. The plurality of cameras 112 may not be installed over the entire circumference of the shooting area, and may be installed only in a part of the periphery of the shooting area depending on the limitation of the installation location or the like. Further, the plurality of cameras 112 may include photographing devices having different functions such as a telephoto camera and a wide-angle camera.

複数のセンサシステム１１０は、スイッチングハブ１２１に接続され、スイッチングハブ１２１を経由して複数のセンサシステム１１０間のデータ送受信を行う、スター型のネットワークを構成する。また、複数のセンサシステム１１０は、それぞれスイッチングハブ１２１を介して画像生成装置１２２に接続され、複数のカメラ１１２による撮影に基づく複数視点画像を画像生成装置１２２へ出力する。 The plurality of sensor systems 110 are connected to the switching hub 121, and form a star-shaped network in which data is transmitted and received between the plurality of sensor systems 110 via the switching hub 121. Further, each of the plurality of sensor systems 110 is connected to the image generation device 122 via the switching hub 121, and outputs a plurality of viewpoint images based on the images taken by the plurality of cameras 112 to the image generation device 122.

タイムサーバ１２７は、時刻及び同期信号を配信する機能を有し、スイッチングハブ１２１を介して複数のセンサシステム１１０に時刻及び同期信号を配信する。時刻と同期信号を受信したカメラアダプタ１２０は、時刻と同期信号を基にカメラ１１２にＧｅｎｌｏｃｋをかけ画像フレーム同期を行う。即ち、タイムサーバ１２７は、複数のカメラ１１２の撮影タイミングを同期させる。これにより、画像生成システム１００は同じタイミングで撮影された複数の撮影画像に基づいて仮想視点画像を生成できるため、撮影タイミングのずれによる仮想視点画像の品質低下を抑制できる。なお、本実施形態ではタイムサーバ１２７が複数のカメラ１１２の時刻同期を管理するものとするが、これに限らず、時刻同期のための処理をカメラ１１２又はカメラアダプタ１２０が独立して行ってもよい。 The time server 127 has a function of distributing the time and synchronization signals, and distributes the time and synchronization signals to a plurality of sensor systems 110 via the switching hub 121. The camera adapter 120 that has received the time and synchronization signal genlocks the camera 112 based on the time and synchronization signal to synchronize the image frame. That is, the time server 127 synchronizes the shooting timings of the plurality of cameras 112. As a result, the image generation system 100 can generate a virtual viewpoint image based on a plurality of captured images taken at the same timing, so that it is possible to suppress deterioration of the quality of the virtual viewpoint image due to a deviation in the shooting timing. In the present embodiment, the time server 127 manages the time synchronization of a plurality of cameras 112, but the present invention is not limited to this, and the camera 112 or the camera adapter 120 may independently perform the processing for time synchronization. good.

コントローラ１２３は、制御ステーション１２４と仮想カメラ操作ＵＩ１２５を有する。制御ステーション１２４は、画像生成システム１００を構成するそれぞれの装置とネットワークを介して接続され、各装置の動作状態の管理及びパラメータ設定制御などを行う。ここで、ネットワークはＥｔｈｅｒｎｅｔ（登録商標）であるＩＥＥＥ標準準拠のＧｂＥ（ギガビットイーサーネット）や１０ＧｂＥでもよいし、インターコネクトＩｎｆｉｎｉｂａｎｄ、産業用イーサーネット等を組合せて構成されてもよい。また、これらに限定されず、他の種別のネットワークであってもよい。 The controller 123 has a control station 124 and a virtual camera operation UI 125. The control station 124 is connected to each device constituting the image generation system 100 via a network, and manages the operating state of each device and controls parameter setting. Here, the network may be an IEEE (registered trademark) compliant GbE (Gigabit Ethernet) or 10 GbE, or may be configured by combining an interconnect Infiniband, an industrial Ethernet, or the like. Further, the network is not limited to these, and may be another type of network.

具体的には、制御ステーション１２４は、画像生成システム１００についての各種設定や制御を実行する。また、制御ステーション１２４は、撮影対象のスタジアム等の三次元モデルを画像生成装置１２２に送信する。さらに、制御ステーション１２４は、複数のカメラ１１２のキャリブレーションを実施する。カメラキャリブレーションでは、撮影対象のフィールド上にマーカーを設置して複数のカメラ１１２で撮影を行い、撮影画像からカメラ１１２それぞれの世界座標系における位置と向き、および焦点距離が算出される。算出されたカメラ１１２の位置、向き、及び焦点距離の情報は、画像生成装置１２２に送信される。送信された三次元モデルおよびカメラ１１２の情報は、画像生成装置１２２が仮想視点画像を生成する際に使用される。 Specifically, the control station 124 executes various settings and controls for the image generation system 100. Further, the control station 124 transmits a three-dimensional model of the stadium or the like to be photographed to the image generation device 122. Further, the control station 124 calibrates the plurality of cameras 112. In camera calibration, a marker is placed on the field to be photographed, images are taken by a plurality of cameras 112, and the position and orientation of each camera 112 in the world coordinate system and the focal length are calculated from the captured images. The calculated position, orientation, and focal length information of the camera 112 is transmitted to the image generator 122. The transmitted 3D model and camera 112 information is used by the image generator 122 to generate a virtual viewpoint image.

仮想カメラ操作ＵＩ１２５は、生成すべき仮想視点画像に対応する仮想視点を指定するためのユーザ操作を受け付け、ユーザ操作に応じた視点情報を、仮想視点画像を生成する画像生成装置１２２に送信する。仮想視点画像の生成に用いられる視点情報は、仮想視点の位置及び向き（視線方向）を示す情報である。具体的には、視点情報は、仮想視点の三次元位置を表すパラメータと、パン、チルト、及びロール方向における仮想視点の向きを表すパラメータとを含む、パラメータセットを有する。また、視点情報は複数の時点にそれぞれ対応する複数のパラメータセットを有する。例えば、視点情報は、仮想視点画像の動画を構成する複数のフレームにそれぞれ対応する複数のパラメータセットを有し、連続する複数の時点それぞれにおける仮想視点の位置及び向きを示す。なお、視点情報の内容は上記に限定されない。例えば、視点情報としてのパラメータセットには、仮想視点の視野の大きさ（画角）を表すパラメータや、時刻を表すパラメータが含まれてもよい。 The virtual camera operation UI 125 accepts a user operation for designating a virtual viewpoint corresponding to the virtual viewpoint image to be generated, and transmits the viewpoint information corresponding to the user operation to the image generation device 122 that generates the virtual viewpoint image. The viewpoint information used to generate the virtual viewpoint image is information indicating the position and direction (line-of-sight direction) of the virtual viewpoint. Specifically, the viewpoint information has a parameter set including a parameter representing a three-dimensional position of the virtual viewpoint and a parameter representing the orientation of the virtual viewpoint in the pan, tilt, and roll directions. In addition, the viewpoint information has a plurality of parameter sets corresponding to a plurality of time points. For example, the viewpoint information has a plurality of parameter sets corresponding to a plurality of frames constituting a moving image of the virtual viewpoint image, and indicates the position and orientation of the virtual viewpoint at each of a plurality of consecutive time points. The content of the viewpoint information is not limited to the above. For example, the parameter set as the viewpoint information may include a parameter representing the size (angle of view) of the field of view of the virtual viewpoint and a parameter representing the time.

画像生成装置１２２は、複数のセンサシステム１１０から取得した複数視点画像と、仮想カメラ操作ＵＩ１２５から取得した視点情報とに基づいて、仮想視点画像を生成する。仮想視点画像は、例えば以下のような方法で生成される。まず、複数の撮像装置によりそれぞれ異なる方向から撮像することで得られた複数視点画像から、人物やボールなどの所定のオブジェクトに対応する前景領域を抽出した前景画像と、前景領域以外の背景領域を抽出した背景画像が取得される。また、所定のオブジェクトの三次元形状を表す前景モデルと前景モデルに色付けするためのテクスチャデータとが前景画像に基づいて生成され、競技場などの背景の三次元形状を表す背景モデルに色づけするためのテクスチャデータが背景画像に基づいて生成される。そして、前景モデルと背景モデルに対してテクスチャデータをマッピングし、視点情報が示す仮想視点に応じてレンダリングを行うことにより、仮想視点画像が生成される。ただし、仮想視点画像の生成方法はこれに限定されず、三次元モデルを用いずに撮像画像の射影変換により仮想視点画像を生成する方法など、種々の方法を用いることができる。 The image generation device 122 generates a virtual viewpoint image based on the plurality of viewpoint images acquired from the plurality of sensor systems 110 and the viewpoint information acquired from the virtual camera operation UI 125. The virtual viewpoint image is generated by, for example, the following method. First, a foreground image in which a foreground area corresponding to a predetermined object such as a person or a ball is extracted from a multi-viewpoint image obtained by taking images from different directions by a plurality of image pickup devices and a background area other than the foreground area are obtained. The extracted background image is acquired. In addition, a foreground model representing the three-dimensional shape of a predetermined object and texture data for coloring the foreground model are generated based on the foreground image, and the background model representing the three-dimensional shape of the background such as a stadium is colored. Texture data is generated based on the background image. Then, a virtual viewpoint image is generated by mapping the texture data to the foreground model and the background model and performing rendering according to the virtual viewpoint indicated by the viewpoint information. However, the method of generating the virtual viewpoint image is not limited to this, and various methods such as a method of generating a virtual viewpoint image by projective transformation of the captured image without using a three-dimensional model can be used.

画像生成装置１２２によって生成された仮想視点画像は、エンドユーザ端末１２６に送信され、エンドユーザ端末１２６が有する表示画面に表示される。なお、エンドユーザ端末１２６は、仮想カメラ操作ＵＩ１２５と同様に、仮想視点を指定するためのユーザ操作に応じた視点情報を画像生成装置１２２に出力してもよい。これにより、エンドユーザ端末１２６を操作するユーザは、視点の指定に応じた画像閲覧及び音声視聴が出来る。 The virtual viewpoint image generated by the image generation device 122 is transmitted to the end user terminal 126 and displayed on the display screen of the end user terminal 126. Note that the end user terminal 126 may output the viewpoint information corresponding to the user operation for designating the virtual viewpoint to the image generation device 122, similarly to the virtual camera operation UI 125. As a result, the user who operates the end user terminal 126 can view images and listen to audio according to the designation of the viewpoint.

画像生成装置１２２は、仮想視点画像をＨ．２６４やＨＥＶＣ等に代表される標準技術により圧縮符号化したうえで、ＭＰＥＧ－ＤＡＳＨプロトコルを使ってエンドユーザ端末１２６へデータを送信してもよい。また、仮想視点画像は、非圧縮でエンドユーザ端末１２６へ送信されてもよい。例えば、エンドユーザ端末１２６としてスマートフォンやタブレットが用いられる場合には圧縮符号化が行われ、エンドユーザ端末１２６が非圧縮画像を表示可能なディスプレイである場合には非圧縮画像が送信されてもよい。すなわち、エンドユーザ端末１２６の種別に応じて画像フォーマットが切り替え可能である。また、画像の送信プロトコルはＭＰＥＧ－ＤＡＳＨに限らず、例えば、ＨＬＳ（ＨＴＴＰＬｉｖｅＳｔｒｅａｍｉｎｇ）やその他の送信方法が用いられても良い。 The image generator 122 converts the virtual viewpoint image into H. Data may be transmitted to the end user terminal 126 using the MPEG-DASH protocol after being compressed and coded by a standard technique such as 264 or HEVC. Further, the virtual viewpoint image may be transmitted to the end user terminal 126 without compression. For example, when a smartphone or tablet is used as the end user terminal 126, compression coding may be performed, and when the end user terminal 126 is a display capable of displaying an uncompressed image, the uncompressed image may be transmitted. .. That is, the image format can be switched according to the type of the end user terminal 126. Further, the image transmission protocol is not limited to MPEG-DASH, and for example, HLS (HTTP Live Streaming) or other transmission method may be used.

［ハードウェア構成］
画像生成システム１００に含まれる情報処理装置の一例としての画像生成装置１２２のハードウェア構成について、図２（ａ）を用いて説明する。なお、図１に示した画像生成システム１００に含まれる他の装置のハードウェア構成も、以下で説明する画像生成装置１２２の構成と同様であってよい。画像生成装置１２２は、ＣＰＵ２１１、ＲＯＭ２１２、ＲＡＭ２１３、補助記憶装置２１４、表示部２１５、操作部２１６、通信Ｉ／Ｆ２１７、及びバス２１８を有する。 [Hardware configuration]
The hardware configuration of the image generation device 122 as an example of the information processing device included in the image generation system 100 will be described with reference to FIG. 2A. The hardware configuration of the other devices included in the image generation system 100 shown in FIG. 1 may be the same as the configuration of the image generation device 122 described below. The image generation device 122 includes a CPU 211, a ROM 212, a RAM 213, an auxiliary storage device 214, a display unit 215, an operation unit 216, a communication I / F 217, and a bus 218.

ＣＰＵ２１１は、ＲＯＭ２１２やＲＡＭ２１３に格納されているコンピュータプログラムやデータを用いて画像生成装置１２２の全体を制御することで、図２（ｂ）に示す画像生成装置１２２の各機能を実現する。なお、画像生成装置１２２がＣＰＵ２１１とは異なる１又は複数の専用のハードウェアを有し、ＣＰＵ２１１による処理の少なくとも一部を専用のハードウェアが実行してもよい。専用のハードウェアの例としては、ＡＳＩＣ（特定用途向け集積回路）、ＦＰＧＡ（フィールドプログラマブルゲートアレイ）、およびＤＳＰ（デジタルシグナルプロセッサ）などがある。ＲＯＭ２１２は、変更を必要としないプログラムなどを格納する。ＲＡＭ２１３は、補助記憶装置２１４から供給されるプログラムやデータ、及び通信Ｉ／Ｆ２１７を介して外部から供給されるデータなどを一時記憶する。補助記憶装置２１４は、例えばハードディスクドライブ等で構成され、画像データや音声データなどの種々のデータを記憶する。 The CPU 211 realizes each function of the image generation device 122 shown in FIG. 2B by controlling the entire image generation device 122 by using computer programs and data stored in the ROM 212 and the RAM 213. The image generation device 122 may have one or more dedicated hardware different from the CPU 211, and the dedicated hardware may execute at least a part of the processing by the CPU 211. Examples of dedicated hardware include ASICs (Application Specific Integrated Circuits), FPGAs (Field Programmable Gate Arrays), and DSPs (Digital Signal Processors). The ROM 212 stores programs and the like that do not require changes. The RAM 213 temporarily stores programs and data supplied from the auxiliary storage device 214, data supplied from the outside via the communication I / F 217, and the like. The auxiliary storage device 214 is composed of, for example, a hard disk drive or the like, and stores various data such as image data and audio data.

表示部２１５は、例えば液晶ディスプレイやＬＥＤ等で構成され、ユーザが画像生成装置１２２を操作するためのＧＵＩ（ＧｒａｐｈｉｃａｌＵｓｅｒＩｎｔｅｒｆａｃｅ）などを表示する。操作部２１６は、例えばキーボードやマウス、ジョイスティック、タッチパネル等で構成され、ユーザによる操作を受けて各種の指示をＣＰＵ２１１に入力する。ＣＰＵ２１１は、表示部２１５を制御する表示制御部、及び操作部２１６を制御する操作制御部として動作する。通信Ｉ／Ｆ２１７は、画像生成装置１２２の外部の装置との通信に用いられる。例えば、画像生成装置１２２が外部の装置と有線で接続される場合には、通信用のケーブルが通信Ｉ／Ｆ２１７に接続される。画像生成装置１２２が外部の装置と無線通信する機能を有する場合には、通信Ｉ／Ｆ２１７はアンテナを備える。バス２１８は画像生成装置１２２の各部をつないで情報を伝達する。 The display unit 215 is composed of, for example, a liquid crystal display, an LED, or the like, and displays a GUI (Graphical User Interface) for the user to operate the image generation device 122. The operation unit 216 is composed of, for example, a keyboard, a mouse, a joystick, a touch panel, or the like, and inputs various instructions to the CPU 211 in response to an operation by the user. The CPU 211 operates as a display control unit that controls the display unit 215 and an operation control unit that controls the operation unit 216. The communication I / F 217 is used for communication with an external device of the image generation device 122. For example, when the image generation device 122 is connected to an external device by wire, a communication cable is connected to the communication I / F 217. When the image generator 122 has a function of wirelessly communicating with an external device, the communication I / F 217 includes an antenna. The bus 218 connects each part of the image generation device 122 to transmit information.

本実施形態では表示部２１５と操作部２１６が画像生成装置１２２の内部に存在するものとするが、表示部２１５と操作部２１６との少なくとも一方が画像生成装置１２２の外部に別の装置として存在していてもよい。 In the present embodiment, it is assumed that the display unit 215 and the operation unit 216 exist inside the image generation device 122, but at least one of the display unit 215 and the operation unit 216 exists as another device outside the image generation device 122. You may be doing it.

［機能構成］
図２（ｂ）は、画像生成装置１２２の機能構成の例を示す図である。データ受信部２０１は、複数のカメラ１１２による撮影に基づく画像データを、スイッチングハブ１２１を介して受信する。ここで受信される画像データは、カメラ１１２により撮影された撮影画像であってもよいし、撮影画像から特定のオブジェクトに対応する領域を抽出することで得られる画像であってもよい。本実施形態では、データ受信部２０１が取得する画像データは、複数のフレームにより構成される動画の撮影画像であるものとする。すなわち、データ受信部２０１は、所定の撮影期間における複数の撮影装置による撮影に基づく複数の動画を取得する。 [Functional configuration]
FIG. 2B is a diagram showing an example of the functional configuration of the image generation device 122. The data receiving unit 201 receives image data based on shooting by a plurality of cameras 112 via the switching hub 121. The image data received here may be a captured image captured by the camera 112, or may be an image obtained by extracting a region corresponding to a specific object from the captured image. In the present embodiment, the image data acquired by the data receiving unit 201 is assumed to be a captured image of a moving image composed of a plurality of frames. That is, the data receiving unit 201 acquires a plurality of moving images based on shooting by a plurality of shooting devices in a predetermined shooting period.

モデル生成部２０２は、データ受信部２０１により取得された画像データを用いて、撮影画像のフレームに対応する撮影時刻毎に、撮影領域内のオブジェクトの三次元形状を表す三次元モデルを生成する。三次元モデルの生成方法には様々な手法が存在するが、本実施形態ではＶｉｓｕａｌＨｕｌｌ又は視体積交差法と呼ばれる、三次元空間内のボクセルのうち複数のカメラ１１２から観察される被写体領域を残すことによって三次元モデル取得する方法が利用される。ただし、モデル生成部２０２による三次元モデルの生成方法はこれに限定されない。また、三次元モデルの表現方法も各種存在するが、本実施例ではボクセル（点）の集合により表現される三次元モデルを扱うものとする。ただし、三次元モデルがポリゴン等により表現されてもよい。三次元モデルの詳細については後述する。 The model generation unit 202 uses the image data acquired by the data reception unit 201 to generate a three-dimensional model representing the three-dimensional shape of the object in the shooting area at each shooting time corresponding to the frame of the shot image. There are various methods for generating a three-dimensional model, but in this embodiment, a subject area observed from a plurality of cameras 112 among voxels in a three-dimensional space, which is called Visual Hull or visual volume crossing method, is left. Therefore, the method of acquiring a three-dimensional model is used. However, the method of generating a three-dimensional model by the model generation unit 202 is not limited to this. In addition, there are various methods for expressing a three-dimensional model, but in this embodiment, a three-dimensional model expressed by a set of voxels (points) is dealt with. However, the three-dimensional model may be represented by polygons or the like. The details of the three-dimensional model will be described later.

姿勢推定部２０３は、データ受信部２０１により取得された画像データを用いて、撮影画像のフレームに対応する撮影時刻毎に、撮影領域内のオブジェクトの姿勢を表す姿勢情報を生成する。姿勢情報の生成方法として、本実施形態では、深層学習を利用した姿勢推定を利用するものとする。また本実施形態では、姿勢情報は対象のオブジェクトの骨格（スケルトン）を表現するボーンモデルを表す情報であるものとする。但し、姿勢情報の内容及び生成方法はこれらに限定されない。姿勢情報の詳細については後述する。姿勢補間部２０４は、姿勢推定部２０３により生成された複数の時刻それぞれにおける姿勢情報を利用して、それらの時刻の中間の時刻における姿勢情報を補間により生成する。補間する時刻の情報は、制御部２０８により指示される。 The posture estimation unit 203 uses the image data acquired by the data receiving unit 201 to generate posture information representing the posture of the object in the shooting area at each shooting time corresponding to the frame of the shot image. As a method of generating posture information, in this embodiment, posture estimation using deep learning is used. Further, in the present embodiment, the posture information is assumed to be information representing a bone model representing the skeleton of the target object. However, the content and generation method of the posture information are not limited to these. The details of the posture information will be described later. The posture interpolation unit 204 utilizes the posture information generated by the posture estimation unit 203 at each of the plurality of times, and generates the posture information at a time intermediate between those times by interpolation. The information of the time to be interpolated is instructed by the control unit 208.

動きベクトル算出部２０５は、姿勢推定部２０３により生成された姿勢情報が表すボーンモデルと、姿勢補間部２０４により補間して生成された姿勢情報が表すボーンモデルとの間の差異を示す動きベクトルを取得する。モデル補間部２０６は、モデル生成部２０２によって生成された三次元モデルと、動きベクトル算出部２０５により求められた動きベクトルを用いて、補間する時刻における三次元モデルを生成する。 The motion vector calculation unit 205 obtains a motion vector showing the difference between the bone model represented by the posture information generated by the posture estimation unit 203 and the bone model represented by the posture information interpolated by the posture interpolation unit 204. get. The model interpolation unit 206 uses the three-dimensional model generated by the model generation unit 202 and the motion vector obtained by the motion vector calculation unit 205 to generate a three-dimensional model at the time of interpolation.

レンダリング処理部２０７は、三次元モデルのデータとデータ受信部２０１により取得された画像データを基に、仮想視点画像を生成する処理を行う。制御部２０８は、画像生成装置１２２が行う各処理の順序等を制御する。 The rendering processing unit 207 performs a process of generating a virtual viewpoint image based on the data of the three-dimensional model and the image data acquired by the data receiving unit 201. The control unit 208 controls the order of each process performed by the image generation device 122.

［三次元モデルと姿勢情報］
図３を用いて、三次元モデルと姿勢情報について説明をする。図３は、撮影領域内のオブジェクトの例である人物とその三次元モデル及びボーンモデルを示す模式図である。なお、三次元モデルは三次元空間におけるオブジェクトの位置及び形状を示すモデルであり、ボーンモデルは三次元空間におけるオブジェクトの姿勢を示すモデルあるが、説明の簡略化のために図３では二次元的に表現する。複数のカメラ１１２が撮影領域内のオブジェクト３０１を撮影することにより得られる画像データに基づいて、三次元モデル３０２を表す三次元形状データ及びボーンモデル３０３を表す姿勢情報が生成される。 [3D model and posture information]
The three-dimensional model and the posture information will be described with reference to FIG. FIG. 3 is a schematic diagram showing a person who is an example of an object in the shooting area, a three-dimensional model thereof, and a bone model. The three-dimensional model is a model showing the position and shape of the object in the three-dimensional space, and the bone model is a model showing the posture of the object in the three-dimensional space. Express in. Based on the image data obtained by photographing the object 301 in the photographing area by the plurality of cameras 112, the three-dimensional shape data representing the three-dimensional model 302 and the attitude information representing the bone model 303 are generated.

本実施形態における三次元モデル３０２は、ボクセルの集合である点群で表現される。点群は、三次元空間内の各ボクセルの三次元位置情報（ｘ，ｙ，ｚ）と、１つのボクセルの大きさを示す情報により表される。ボクセルは立方体であり、ボクセルの大きさは例えば一辺の長さで表現される。ボクセルの集合によりオブジェクト３０１の三次元形状が表現されるため、三次元モデル３０２によって表現される三次元形状の精度はボクセルが細かいほど高くなる。一方、ボクセルが細かいと、三次元モデル３０２を構成するボクセルの数が多くなるため、三次元モデルの情報量（三次元形状データのデータサイズ）が大きくなる。 The three-dimensional model 302 in this embodiment is represented by a point cloud which is a set of voxels. The point cloud is represented by three-dimensional position information (x, y, z) of each voxel in the three-dimensional space and information indicating the size of one voxel. A voxel is a cube, and the size of a voxel is expressed, for example, by the length of one side. Since the three-dimensional shape of the object 301 is represented by the set of voxels, the accuracy of the three-dimensional shape represented by the three-dimensional model 302 becomes higher as the voxels are finer. On the other hand, if the voxels are fine, the number of voxels constituting the three-dimensional model 302 increases, so that the amount of information (data size of the three-dimensional shape data) of the three-dimensional model increases.

姿勢情報が表すボーンモデル３０３は、図３に示すように、オブジェクト３０１の構造上の主要な節点と、節点間を接続する線により構成される。三次元モデル３０２と比較するとボーンモデル３０３は情報量が少ないため、姿勢情報は三次元形状データよりも小さいデータサイズでオブジェクト３０１の大まかな動きや姿勢の状態を表現することが可能である。 As shown in FIG. 3, the bone model 303 represented by the posture information is composed of structural main nodes of the object 301 and lines connecting the nodes. Since the bone model 303 has a smaller amount of information than the three-dimensional model 302, it is possible to express the rough movement and posture state of the object 301 with a data size smaller than that of the three-dimensional shape data.

図４を用いて、カメラ１１２により取得される撮影画像、モデル生成部２０２により生成される三次元モデル、及び姿勢推定部２０３により生成される姿勢情報の時間的な関係について説明を行う。本実施形態では、カメラ１１２の撮影フレームレート（撮影画像のフレームレート）が６０ｆｐｓであるものとする。つまり、１／６０秒毎にカメラ１１２により１フレームの撮影画像が取得される。三次元モデルと姿勢情報もそれぞれ、撮影画像に基づいて、撮影画像と同じ６０ｆｐｓのフレームレートで生成される。このような６０ｆｐｓのフレームレートの三次元モデルを用いて仮想視点画像を生成する場合、仮想視点画像のフレームレートも６０ｆｐｓとなる。 FIG. 4 will explain the temporal relationship between the photographed image acquired by the camera 112, the three-dimensional model generated by the model generation unit 202, and the attitude information generated by the attitude estimation unit 203. In the present embodiment, it is assumed that the shooting frame rate (frame rate of the shot image) of the camera 112 is 60 fps. That is, one frame of captured image is acquired by the camera 112 every 1/60 second. The three-dimensional model and the attitude information are also generated based on the captured image at the same frame rate of 60 fps as the captured image. When a virtual viewpoint image is generated using such a three-dimensional model having a frame rate of 60 fps, the frame rate of the virtual viewpoint image is also 60 fps.

一方、撮影画像のフレームレートより高いフレームレートの仮想視点画像を生成することが求められる場合がある。そこで、画像生成システム１００は、撮影画像に対応する時刻とは異なる時刻における三次元モデルを補間により生成することで、１２０ｆｐｓの仮想視点画像を生成する。具体的には、姿勢補間部２０４が、２つの時間的に連続するフレームそれぞれに対応する姿勢情報から、それらのフレームに対応する撮影時刻の中間の時刻における姿勢情報を補間により生成する。そしてモデル補間部２０６が、姿勢補間部２０４により生成された姿勢情報に基づいて、補間により生成された姿勢情報と同時刻に対応する三次元モデルを生成する。 On the other hand, it may be required to generate a virtual viewpoint image having a frame rate higher than the frame rate of the captured image. Therefore, the image generation system 100 generates a virtual viewpoint image of 120 fps by generating a three-dimensional model at a time different from the time corresponding to the captured image by interpolation. Specifically, the posture interpolation unit 204 generates posture information at a time intermediate between the shooting times corresponding to those frames from the posture information corresponding to each of the two temporally continuous frames by interpolation. Then, the model interpolation unit 206 generates a three-dimensional model corresponding to the attitude information generated by the interpolation at the same time based on the attitude information generated by the attitude interpolation unit 204.

図９に、補間により生成された姿勢情報（以下では補間姿勢情報と表記する）と補間姿勢情報に基づいて生成された三次元モデル（以下では補間三次元モデルと表記する）の時間的な位置付けを示す。撮影画像は１／６０秒ごとに１フレームが取得されるが、補間姿勢情報と補間三次元モデルが生成されることにより、１／１２０秒ごとの姿勢情報と三次元モデルが得られる。この三次元モデルを用いることで、撮影画像のフレームレートの２倍である１２０ｆｐｓの仮想視点画像を生成することが可能になる。 In FIG. 9, the posture information generated by interpolation (hereinafter referred to as interpolated posture information) and the three-dimensional model generated based on the interpolated posture information (hereinafter referred to as interpolated three-dimensional model) are temporally positioned. Is shown. One frame is acquired every 1/60 second of the captured image, and the posture information and the three-dimensional model are obtained every 1/120 second by generating the interpolated posture information and the interpolated three-dimensional model. By using this three-dimensional model, it is possible to generate a virtual viewpoint image of 120 fps, which is twice the frame rate of the captured image.

［動作フロー］
図１０は、画像生成装置１２２の動作の例を示すフローチャートである。図１０に示す処理は、画像生成装置１２２のＣＰＵ２１１がＲＯＭ２１２に格納されたプログラムをＲＡＭ２１３に展開して実行することで実現される。なお、図１０に示す処理の少なくとも一部を、ＣＰＵ２１１とは異なる１又は複数の専用のハードウェアにより実現してもよい。図１０に示す処理は、複数のカメラ１１２による撮影が行われ、仮想視点画像を生成するための指示が画像生成装置１２２に入力されたタイミングで開始される。ただし、図１０に示す処理の開始タイミングはこれに限定されない。図１０に示す処理は、複数のカメラ１１２による撮影中に実行されてもよいし、撮影が完了して撮影画像が記録された後に実行されてもよい。 [Operation flow]
FIG. 10 is a flowchart showing an example of the operation of the image generation device 122. The process shown in FIG. 10 is realized by the CPU 211 of the image generation device 122 expanding the program stored in the ROM 212 into the RAM 213 and executing the program. It should be noted that at least a part of the processing shown in FIG. 10 may be realized by one or a plurality of dedicated hardware different from the CPU 211. The process shown in FIG. 10 is taken by a plurality of cameras 112, and is started at the timing when an instruction for generating a virtual viewpoint image is input to the image generation device 122. However, the start timing of the process shown in FIG. 10 is not limited to this. The process shown in FIG. 10 may be executed during shooting by a plurality of cameras 112, or may be executed after shooting is completed and a shot image is recorded.

Ｓ１００１において、データ受信部２０１は、複数のカメラ１１２による撮影に基づく撮影画像を取得する。Ｓ１００２において、モデル生成部２０２は、撮影画像に基づいて、撮影画像の時刻と同時刻における三次元モデルを表す三次元形状データを生成する。この三次元モデルを以下では基準三次元モデルと表記する。Ｓ１００３において、姿勢推定部２０３は、撮影画像に基づいて、撮影画像の時刻と同時刻における姿勢情報を生成する。この姿勢情報を以下では基準姿勢情報と表記する。 In S1001, the data receiving unit 201 acquires a photographed image based on the image taken by the plurality of cameras 112. In S1002, the model generation unit 202 generates three-dimensional shape data representing a three-dimensional model at the same time as the time of the photographed image based on the photographed image. This three-dimensional model will be referred to as a reference three-dimensional model below. In S1003, the posture estimation unit 203 generates posture information at the same time as the time of the photographed image based on the photographed image. This posture information will be referred to as reference posture information below.

Ｓ１００４において、姿勢補間部２０４、動きベクトル算出部２０５、及びモデル補間部２０６は、基準三次元モデルと基準姿勢情報に基づいて補間三次元モデルを表す三次元形状データを生成する。Ｓ１００５において、レンダリング処理部２０７は、基準三次元モデルを用いて基準フレームの仮想視点画像のレンダリングを行う。仮想視点画像の基準フレームとは、撮影画像のフレームと同時刻に対応するフレームである。Ｓ１００６において、レンダリング処理部２０７は、補間三次元モデルを用いて補間フレームの仮想視点画像のレンダリングを行う。仮想視点画像の補間フレームとは、撮影画像のフレームとは異なる時刻に対応するフレームであり、２つの連続する基準フレームの中間に挿入されるフレームである。 In S1004, the attitude interpolation unit 204, the motion vector calculation unit 205, and the model interpolation unit 206 generate three-dimensional shape data representing the interpolation three-dimensional model based on the reference three-dimensional model and the reference attitude information. In S1005, the rendering processing unit 207 renders the virtual viewpoint image of the reference frame using the reference three-dimensional model. The reference frame of the virtual viewpoint image is a frame corresponding to the same time as the frame of the captured image. In S1006, the rendering processing unit 207 renders the virtual viewpoint image of the interpolated frame using the interpolated three-dimensional model. The interpolation frame of the virtual viewpoint image is a frame corresponding to a time different from the frame of the captured image, and is a frame inserted between two consecutive reference frames.

Ｓ１００４及びＳ１００５におけるレンダリング処理により、撮影画像のフレームレートより高いフレームレートの仮想視点画像が生成される。Ｓ１００７において、レンダリング処理部２０７は、生成された仮想視点画像をエンドユーザ端末１２６へ出力する。出力された仮想視点画像は、エンドユーザ端末１２６の画面に表示される。このように、撮影画像のフレームレートより高いフレームレートの仮想視点画像を生成することで、例えば、撮影画像のフレームレートより高いフレームレートで画像を表示可能なデバイスで仮想視点画像を表示する場合に、滑らかな動画の再生が可能となる。また例えば、高いフレームレートの仮想視点画像をスロー再生することで、スロー動画を滑らかに再生することが可能となる。 The rendering process in S1004 and S1005 generates a virtual viewpoint image having a frame rate higher than the frame rate of the captured image. In S1007, the rendering processing unit 207 outputs the generated virtual viewpoint image to the end user terminal 126. The output virtual viewpoint image is displayed on the screen of the end user terminal 126. In this way, by generating a virtual viewpoint image with a frame rate higher than the frame rate of the captured image, for example, when displaying the virtual viewpoint image on a device capable of displaying the image at a frame rate higher than the frame rate of the captured image. , Smooth video playback is possible. Further, for example, by slow-playing a virtual viewpoint image having a high frame rate, it is possible to smoothly play a slow moving image.

次に、Ｓ１００４における補間三次元モデルを生成する処理の詳細について、図５を用いて説明する。Ｓ５０１にて、制御部２０８は、補間により生成すべき補間フレームの時刻情報を取得する。本実施形態では、６０ｆｐｓの撮影画像から１２０ｆｐｓの仮想視点画像が生成されるため、補間フレームの時刻情報は、複数の基準フレームのそれぞれに対応する時刻の中間の時刻を示す。補間フレームの時刻情報は、ユーザ操作に基づいて取得される。例えばユーザが「１２０ｆｐｓ」や「倍速」を指定する操作を行った場合に、１２０ｆｐｓの仮想視点画像を生成するための補間フレームの時刻情報が取得される。ただし、補間フレームの時刻情報の取得方法はこれに限定されず、制御部２０８は、撮影領域におけるオブジェクトの状況や撮影対象のイベント等に基づいて決められた時刻情報を取得してもよい。 Next, the details of the process of generating the interpolated three-dimensional model in S1004 will be described with reference to FIG. In S501, the control unit 208 acquires the time information of the interpolation frame to be generated by interpolation. In the present embodiment, since the virtual viewpoint image of 120 fps is generated from the captured image of 60 fps, the time information of the interpolated frame indicates an intermediate time of the time corresponding to each of the plurality of reference frames. The time information of the interpolated frame is acquired based on the user operation. For example, when the user performs an operation of designating "120 fps" or "double speed", the time information of the interpolated frame for generating the virtual viewpoint image of 120 fps is acquired. However, the method of acquiring the time information of the interpolated frame is not limited to this, and the control unit 208 may acquire the time information determined based on the situation of the object in the shooting area, the event to be shot, and the like.

Ｓ５０２において、姿勢補間部２０４は、補間フレームの前後の基準フレームに対応する基準姿勢情報から、補間フレームに対応する時刻の姿勢情報を補間により生成する。Ｓ５０２で実施される姿勢情報の補間方法について、図６を用いて説明する。ここでは、フレームＮとフレームＮ＋１という二つの連続する基準フレームの中間の時刻に対応する補間フレームの姿勢情報を生成する例について説明する。 In S502, the posture interpolation unit 204 generates posture information at the time corresponding to the interpolation frame from the reference posture information corresponding to the reference frames before and after the interpolation frame by interpolation. The method of interpolating the posture information performed in S502 will be described with reference to FIG. Here, an example of generating the attitude information of the interpolated frame corresponding to the time in the middle of two consecutive reference frames, frame N and frame N + 1, will be described.

ボーンモデル６００は、フレームＮの姿勢情報が表すボーンモデルであり、フレームＮに対応する時刻におけるオブジェクトの姿勢を表す。また、ボーンモデル６２０は、フレームＮ＋１の姿勢情報が表すボーンモデルであり、フレームＮ＋１に対応する時刻におけるオブジェクトの姿勢を表す。ボーンモデル６１０は、補間フレームの姿勢情報が表すボーンモデルであり、補間フレームに対応する時刻におけるオブジェクトの姿勢を表す。 The bone model 600 is a bone model represented by the posture information of the frame N, and represents the posture of the object at the time corresponding to the frame N. Further, the bone model 620 is a bone model represented by the posture information of the frame N + 1, and represents the posture of the object at the time corresponding to the frame N + 1. The bone model 610 is a bone model represented by the posture information of the interpolated frame, and represents the posture of the object at the time corresponding to the interpolated frame.

姿勢補間部２０４は、ボーンモデル６００における節点６０１の位置とボーンモデル６２０における対応する節点６０２の位置から、補間フレームにおける対応する節点６０３の位置を線形補間により算出する。本実施形態では、２つの基準フレームの間の中央の特定時刻が補間フレームの時刻であるため、補間フレームにおける節点６０３の位置として、節点６０１の座標と節点６０２の座標の平均値が算出される。このようにして補間フレームにおける各節点の位置が算出され、算出された節点間を接続することで、補間フレームのボーンモデル６１０を表す姿勢情報が生成される。 The attitude interpolation unit 204 calculates the position of the corresponding node 603 in the interpolation frame from the position of the node 601 in the bone model 600 and the position of the corresponding node 602 in the bone model 620 by linear interpolation. In the present embodiment, since the central specific time between the two reference frames is the time of the interpolation frame, the average value of the coordinates of the node 601 and the coordinates of the node 602 is calculated as the position of the node 603 in the interpolation frame. .. In this way, the position of each node in the interpolation frame is calculated, and by connecting the calculated nodes, the posture information representing the bone model 610 of the interpolation frame is generated.

Ｓ５０３において、動きベクトル算出部２０５及びモデル補間部２０６は、Ｓ５０２において生成された補間姿勢情報を用いて補間三次元モデルを生成する。Ｓ５０３における処理の詳細について、図７を用いて説明する。Ｓ７０１において、動きベクトル算出部２０５は、基準姿勢情報が表すボーンモデルと補間姿勢情報が表すボーンモデルとの間の動きベクトルを算出する。ここで使用される基準姿勢情報は、補間精度を向上させるために、補間フレームの時刻に近い時刻の基準姿勢情報であることが望ましい。例えば、２つの基準フレームの間の中央の特定時刻を補間フレームとする場合、補間フレームの前後の基準フレームのいずれかの基準姿勢情報が使用される。 In S503, the motion vector calculation unit 205 and the model interpolation unit 206 generate an interpolation three-dimensional model using the interpolation attitude information generated in S502. The details of the processing in S503 will be described with reference to FIG. 7. In S701, the motion vector calculation unit 205 calculates a motion vector between the bone model represented by the reference posture information and the bone model represented by the interpolated posture information. The reference attitude information used here is preferably reference attitude information at a time close to the time of the interpolation frame in order to improve the interpolation accuracy. For example, when the specific time in the center between the two reference frames is used as the interpolation frame, the reference posture information of any of the reference frames before and after the interpolation frame is used.

Ｓ７０２において、モデル補間部２０６は、動きベクトルの大きさに応じて補間フレームのボーンモデルを領域分割する。図８（ａ）は、図６に示した補間フレームにおけるボーンモデル６１０を示す。図８（ｂ）は、ボーンモデル６１０の部分８００を拡大した様子を示す。図８（ｂ）に示すように、基準フレームにおけるボーンモデル６００と補間フレームにおけるボーンモデル６１０との間における領域８１１の動きは、動きベクトル８０１で表される。同様に、領域８１２の動きは動きベクトル８０２で表され、領域８１１の動きは動きベクトル８０１で表される。動きベクトルは単位時間あたりの動き方向と動き量を示すベクトルであり、例えば座標（ｖｘ，ｖｙ，ｖｚ）で表される。なお、本実施形態では動きベクトルの大きさによりボーンモデルを複数の領域に分割するものとするが、これに限らず、その他の基準によってボーンモデルが複数の領域に分割されたうえで、各領域の動きベクトルが算出されてもよい。 In S702, the model interpolation unit 206 divides the bone model of the interpolation frame into regions according to the magnitude of the motion vector. FIG. 8A shows a bone model 610 in the interpolation frame shown in FIG. FIG. 8B shows an enlarged portion 800 of the bone model 610. As shown in FIG. 8B, the movement of the region 811 between the bone model 600 in the reference frame and the bone model 610 in the interpolation frame is represented by the motion vector 801. Similarly, the motion of the region 812 is represented by the motion vector 802, and the motion of the region 811 is represented by the motion vector 801. The motion vector is a vector showing the motion direction and the motion amount per unit time, and is represented by coordinates (vx, vy, vz), for example. In the present embodiment, the bone model is divided into a plurality of regions according to the size of the motion vector, but the present invention is not limited to this, and the bone model is divided into a plurality of regions according to other criteria, and then each region is divided. The motion vector of may be calculated.

Ｓ７０３において、モデル補間部２０６は、基準三次元モデルを構成する各ボクセルの位置を、そのボクセルが属する領域に対応する動きベクトルに従って変更することで、補間三次元モデルを生成する。例えば図８（ｃ）に示すように、フレームＮの基準三次元モデルを構成するボクセル８２１を、ボクセル８２１が属する領域８１３に対応する動きベクトル８０３に従って動かすことで、補間三次元モデルを構成するボクセル８２２が得られる。基準三次元モデルにおけるボクセル８２１の座標を（ｘ，ｙ，ｚ）とすると、補間三次元モデルにおける対応するボクセル８２２の座標（ｘ’，ｙ’，ｚ’）は、以下の式で示すように求められる。
ｘ’＝ｘ＋ｖｘ × ｔ
ｙ’＝ｙ＋ｖｙ × ｔ
ｚ’＝ｚ＋ｖｚ × ｔ
ここでｔは基準フレームの時刻から補間フレームの時刻までの時間であり、本実施形態では１／１２０秒である。このようにして、補間三次元モデルを構成する各ボクセルの位置を算出することで、補間三次元モデルが生成される。 In S703, the model interpolation unit 206 generates an interpolation three-dimensional model by changing the position of each voxel constituting the reference three-dimensional model according to the motion vector corresponding to the region to which the voxel belongs. For example, as shown in FIG. 8C, the voxels 821 constituting the reference three-dimensional model of the frame N are moved according to the motion vector 803 corresponding to the region 813 to which the voxels 821 belong, thereby forming the interpolated three-dimensional model. 822 is obtained. Assuming that the coordinates of the voxel 821 in the reference 3D model are (x, y, z), the coordinates (x', y', z') of the corresponding voxels 822 in the interpolated 3D model are as shown in the following equation. Desired.
x'= x + vx x t
y'= y + by x t
z'= z + vz x t
Here, t is the time from the time of the reference frame to the time of the interpolation frame, which is 1/120 second in the present embodiment. In this way, the interpolated three-dimensional model is generated by calculating the position of each voxel constituting the interpolated three-dimensional model.

［変形例］
上述した実施形態では、撮影画像のフレームレート２倍のフレームレートの仮想視点画像を生成する場合について説明した。ただし、画像生成システム１００により生成される仮想視点画像のフレームレートはこれに限定されず、上述した方法と同様の方法で画像生成システム１００は任意のフレームレートの仮想視点画像を生成することができる。以下では、撮影画像のフレームレートの３倍のフレームレートの仮想視点画像を生成する場合の具体例を示す。 [Modification example]
In the above-described embodiment, a case of generating a virtual viewpoint image having a frame rate twice the frame rate of the captured image has been described. However, the frame rate of the virtual viewpoint image generated by the image generation system 100 is not limited to this, and the image generation system 100 can generate a virtual viewpoint image at an arbitrary frame rate by the same method as described above. .. In the following, a specific example of generating a virtual viewpoint image having a frame rate three times the frame rate of the captured image will be shown.

図１１は、撮影画像、基準三次元モデル、基準姿勢情報、補間三次元モデル、及び補間姿勢情報の時間的な関係を示す。撮影画像のフレームＮ、フレームＮ＋１、及びフレームＮ＋２は連続するフレームであり、フレーム間の間隔は１／６０秒である。そして、撮影画像のフレームレートの３倍のフレームレートの仮想視点画像を生成するために、連続する２つの基準フレームの間に２つの補間フレームが挿入され、各補間フレームに対応する補間姿勢情報および補間三次元モデルが生成される。本変形例では補間フレームを含めた複数フレーム間の時間間隔を等間隔にするため、フレーム間の時間間隔は１／１８０秒となる。 FIG. 11 shows the temporal relationship between the captured image, the reference three-dimensional model, the reference attitude information, the interpolated three-dimensional model, and the interpolated attitude information. Frames N, frames N + 1, and frames N + 2 of the captured image are continuous frames, and the interval between the frames is 1/60 second. Then, in order to generate a virtual viewpoint image having a frame rate three times the frame rate of the captured image, two interpolation frames are inserted between two consecutive reference frames, and the interpolation posture information and the interpolation posture information corresponding to each interpolation frame are inserted. An interpolated three-dimensional model is generated. In this modification, since the time interval between a plurality of frames including the interpolation frame is made equal, the time interval between the frames is 1/180 second.

図１２に、フレームＮの姿勢情報が表すボーンモデル６００と、フレームＮ＋１の姿勢情報が表すボーンモデル６２０と、補間フレームの姿勢情報が表すボーンモデル１２１０を示す。この補間フレームは、フレームＮに対応する時刻の１／１８０秒後の時刻に対応する。ボーンモデル１２１０は、ボーンモデル６００とボーンモデル６２０を用いた補間処理により生成される。具体的には、姿勢補間部２０４が、ボーンモデル６００における節点６０１の位置とボーンモデル６２０における対応する節点６０２の位置から、補間フレームにおける対応する節点１２０３の位置を線形補間により算出する。節点１２０３の座標（ｘ，ｙ，ｚ）は、以下の式で求められる。
ｘ＝ｘ１＋（ｘ２－ｘ１） × ｔ１／Ｔ
ｙ＝ｙ１＋（ｙ２－ｙ１） × ｔ１／Ｔ
ｚ＝ｚ１＋（ｚ２ ― ｚ１） × ｔ１／Ｔ
ここで、（ｘ１，ｙ１，ｚ１）がフレームＮにおける節点６０１の座標であり、（ｘ２，ｙ２，ｚ２）がフレームＮ＋１における節点６０２の座標である。ＴはフレームＮとフレームＮ＋１との間の時間間隔であり、ｔ１はフレームＮとフレームＮに連続する補間フレームとの間の時間間隔である。 FIG. 12 shows a bone model 600 represented by the posture information of the frame N, a bone model 620 represented by the posture information of the frame N + 1, and a bone model 1210 represented by the posture information of the interpolated frame. This interpolated frame corresponds to the time 1/180 second after the time corresponding to the frame N. The bone model 1210 is generated by interpolation processing using the bone model 600 and the bone model 620. Specifically, the posture interpolation unit 204 calculates the position of the corresponding node 1203 in the interpolation frame from the position of the node 601 in the bone model 600 and the position of the corresponding node 602 in the bone model 620 by linear interpolation. The coordinates (x, y, z) of the node 1203 are obtained by the following equation.
x = x1 + (x2-x1) x t1 / T
y = y1 + (y2-y1) × t1 / T
z = z1 + (z2-z1) × t1 / T
Here, (x1, y1, z1) are the coordinates of the node 601 in the frame N, and (x2, y2, z2) are the coordinates of the node 602 in the frame N + 1. T is the time interval between the frame N and the frame N + 1, and t1 is the time interval between the frame N and the interpolated frame continuous with the frame N.

このようにして補間フレームにおける各節点の位置が算出され、算出された節点間を接続することで、補間フレームのボーンモデル１２１０を表す姿勢情報が生成される。なお、フレームＮとフレームＮ＋１との間に挿入されるもう一つの補間フレームに対応する時刻（フレームＮの時刻から２／１８０秒後）におけるボーンモデルも、同様の方法で生成される。そして、生成された補間フレームのボーンモデルを表す補間姿勢情報に基づいて、上述した実施形態と同様に補間三次元モデルが生成される。これにより、１８０ｆｐｓの仮想視点画像の生成が可能となる。 In this way, the position of each node in the interpolated frame is calculated, and by connecting the calculated nodes, the posture information representing the bone model 1210 of the interpolated frame is generated. A bone model at a time corresponding to another interpolation frame inserted between the frame N and the frame N + 1 (2/180 seconds after the time of the frame N) is also generated by the same method. Then, based on the interpolation posture information representing the bone model of the generated interpolation frame, an interpolation three-dimensional model is generated in the same manner as in the above-described embodiment. This makes it possible to generate a virtual viewpoint image of 180 fps.

本発明は、上述の実施形態の１以上の機能を実現するプログラムを、ネットワーク又は記憶媒体を介してシステム又は装置に供給し、そのシステム又は装置のコンピュータにおける１つ以上のプロセッサーがプログラムを読出し実行する処理でも実現可能である。また、１以上の機能を実現する回路（例えば、ＡＳＩＣ等）によっても実現可能である。また、そのプログラムをコンピュータにより読み取り可能な記録媒体に記録して提供してもよい。 The present invention supplies a program that realizes one or more functions of the above-described embodiment to a system or device via a network or storage medium, and one or more processors in the computer of the system or device reads and executes the program. It can also be realized by the processing to be performed. It can also be realized by a circuit (for example, ASIC or the like) that realizes one or more functions. Further, the program may be recorded and provided on a recording medium readable by a computer.

１００画像生成システム
１１２カメラ
１２２画像生成装置 100 Image generation system 112 Camera 122 Image generation device

Claims

First generation that generates shape data representing the three-dimensional shape of the object at the predetermined shooting time based on a plurality of images obtained by shooting the object from different directions by a plurality of shooting devices at a predetermined shooting time. Means and
An acquisition means for acquiring first posture information representing the posture of the object at the predetermined shooting time and second posture information representing the posture of the object at a specific time different from the predetermined shooting time.
Shape data representing the three-dimensional shape of the object at the specific time based on the first posture information and the second posture information acquired by the acquisition means and the shape data generated by the first generation means. An information processing apparatus comprising: a second generation means for generating the data.

The first generation means comprises a plurality of moving images included in the plurality of moving images based on a plurality of moving images obtained by shooting the object from different directions by the plurality of photographing devices in a predetermined shooting period. Generates shape data representing the three-dimensional shape of the object at each of multiple times corresponding to the frame.
The predetermined shooting time is a time corresponding to a frame included in the plurality of frames.
The first aspect of the present invention, wherein the specific time is a time included in the predetermined shooting period and not included in the plurality of times corresponding to the plurality of frames. Information processing device.

The moving image is a virtual viewpoint image according to a virtual viewpoint position and line-of-sight direction by rendering processing using the shape data generated by the first generation means and the shape data generated by the second generation means. The information processing apparatus according to claim 2, further comprising an image generation means for generating a virtual viewpoint image having a frame rate higher than that of the above.

Any of claims 1 to 3, wherein the acquisition means acquires the first posture information based on a plurality of images obtained by photographing with the plurality of photographing devices at the predetermined photographing time. The information processing apparatus according to item 1.

The acquisition means
Acquire third posture information representing the posture of the object at the other shooting time based on a plurality of images obtained by shooting with the plurality of shooting devices at another shooting time different from the predetermined shooting time. death,
The information processing apparatus according to claim 4, wherein the second posture information is acquired by interpolation processing using the first posture information and the third posture information.

The information processing apparatus according to any one of claims 1 to 5, wherein the first posture information and the second posture information are information expressing a model of the skeleton of the object.

The second generation means modifies the three-dimensional shape represented by the shape data generated by the first generation means based on the difference between the posture represented by the first posture information and the posture represented by the second posture information. The information processing apparatus according to any one of claims 1 to 6, wherein the shape data representing the three-dimensional shape of the object at the specific time is generated.

The information processing apparatus according to any one of claims 1 to 7, wherein the shape data is data expressing the three-dimensional shape of the object by voxels.

The information processing apparatus according to any one of claims 1 to 7, wherein the shape data is data representing a three-dimensional shape of the object by polygons.

The information processing apparatus according to any one of claims 1 to 9, wherein the first generation means generates the shape data by using the visual volume crossing method.

First generation that generates shape data representing the three-dimensional shape of the object at the predetermined shooting time based on a plurality of images obtained by shooting the object from different directions by a plurality of shooting devices at a predetermined shooting time. Process and
An acquisition step of acquiring first posture information representing the posture of the object at the predetermined shooting time and second posture information representing the posture of the object at a specific time different from the predetermined shooting time.
Shape data representing the three-dimensional shape of the object at the specific time based on the first posture information and the second posture information acquired in the acquisition step and the shape data generated in the first generation step. A second generation step, and an information processing method comprising.

In the first generation step, a plurality of moving images included in the plurality of moving images are formed based on a plurality of moving images obtained by shooting the object from different directions by the plurality of photographing devices in a predetermined shooting period. Shape data representing the three-dimensional shape of the object at each of the plurality of times corresponding to the frame of is generated.
The predetermined shooting time is a time corresponding to a frame included in the plurality of frames.
The information processing method according to claim 11, wherein the specific time is a time included in the predetermined shooting period and not included in the plurality of times corresponding to the plurality of frames.

The moving image is a virtual viewpoint image according to a virtual viewpoint position and line-of-sight direction by rendering processing using the shape data generated in the first generation step and the shape data generated in the second generation step. The information processing method according to claim 12, further comprising an image generation step of generating a virtual viewpoint image having a frame rate higher than that of the above.

A program for making a computer function as each means of the information processing apparatus according to any one of claims 1 to 10.