JP7767111B2

JP7767111B2 - Image processing device, image processing method and program

Info

Publication number: JP7767111B2
Application number: JP2021177459A
Authority: JP
Inventors: 正明松岡
Original assignee: Canon Inc
Current assignee: Canon Inc
Priority date: 2021-10-29
Filing date: 2021-10-29
Publication date: 2025-11-11
Anticipated expiration: 2041-10-29
Also published as: JP2023066705A

Description

本発明は、機械学習により三次元モデリングを行う画像処理装置に関する。 The present invention relates to an image processing device that performs three-dimensional modeling using machine learning.

従来、物体を様々な角度から撮影した画像を用いて、三次元モデリングを行う技術が知られている。特許文献１には、物体を撮影時とは別の角度から見たときの画像を、少ない演算量で生成する技術が開示されている。しかし、本来、物体から反射された光は見る角度によって見える色が変化するため、特許文献１に開示された技術では、画像を再構成する角度を変化させたときに違和感が生じる場合がある。 Technology for three-dimensional modeling using images of an object taken from various angles is known. Patent Document 1 discloses a technology that generates, with a small amount of calculation, an image of an object viewed from an angle different from that at which it was photographed. However, because the color of light reflected from an object naturally changes depending on the viewing angle, the technology disclosed in Patent Document 1 can sometimes create an unnatural appearance when the angle at which the image is reconstructed is changed.

非特許文献１には、空間上の三次元位置に加えて光線の方向を考慮し、光線上の点をサンプリングしてボリュームレンダリングすることで、実写のような違和感のない画像を再構成する技術が開示されている。 Non-Patent Document 1 discloses a technology that reconstructs realistic images that look like real life images by taking into account the direction of light rays in addition to three-dimensional positions in space, sampling points on the light rays, and performing volume rendering.

特開２０１８－２０５８６３号公報Japanese Patent Application Laid-Open No. 2018-205863 特開２００８－１５７５４号公報JP 2008-15754 A

ＢｅｎＭｉｌｄｅｎｈａｌｌ，ＰｒａｔｕｌＰ．Ｓｒｉｎｉｖａｓａｎ，ＭａｔｔｈｅｗＴａｎｃｉｋ，ＪｏｎａｔｈａｎＴ．Ｂａｒｒｏｎ，ＲａｖｉＲａｍａｍｏｏｒｔｈｉ，ａｎｄＲｅｎＮｇ， “ＮｅＲＦ：ＲｅｐｒｅｓｅｎｔｉｎｇＳｃｅｎｅｓａｓＮｅｕｒａｌＲａｄｉａｎｃｅＦｉｅｌｄｓｆｏｒＶｉｅｗＳｙｎｔｈｅｓｉｓ”，ＩｎＥＣＣＶ，２０２０．Ben Mildenhall, Pratul P. Srinivasan, Matthew Tancik, Jonathan T. Barron, Ravi Ramamoorthi, and Ren Ng, “NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis”, In ECCV, 2020. ＴＩＡＮＹＥＬＩ，ＭＩＲＡＳＬＡＶＣＨＥＶＡ，ＭＩＣＨＡＥＬＺＯＬＬＨＯＥＦＥＲ，ＳＩＭＯＮＧＲＥＥＮ，ＣＨＲＩＳＴＯＰＨＬＡＳＳＮＥＲ，ＣＨＡＮＧＩＬＫＩＭ，ＴＡＮＮＥＲＳＣＨＭＩＤＴ，ＳＴＥＶＥＮＬＯＶＥＧＲＯＶＥ，ＭＩＣＨＡＥＬＧＯＥＳＥＬＥ，ＺＨＡＯＹＡＮＧＬＶ， “Ｎｅｕｒａｌ３ＤＶｉｄｅｏＳｙｎｔｈｅｓｉｓ”，ａｒＸｉｖ：２１０３．０２５９７，２０２１TIANYE LI, MIRA SLAVCHEVA, MICHAEL ZOLLHOEFER, SIMON GREEN, CHRISTOPH LASSNER, CHANGIL KIM, TANNER SCHMIDT, STEVEN LOVEGROVE, MICHAEL GOESELE, ZHAOYANG LV, “Neural 3D Video Synthesis”, arXiv:2103.02597, 2021

しかしながら、非特許文献１に開示された技術では、対象空間の端から端まで光線上の点をサンプリングしてボリュームレンダリングする必要があり、演算量が増えて多大な処理時間を要する。 However, the technology disclosed in Non-Patent Document 1 requires volume rendering by sampling points on light rays from one end of the target space to the other, which increases the amount of calculations and requires a significant amount of processing time.

そこで本発明は、高速にボリュームレンダリングを行うことが可能な画像処理装置、画像処理方法、およびプログラムを提供することを目的とする。 The present invention therefore aims to provide an image processing device, image processing method, and program capable of performing volume rendering at high speed.

本発明の一側面としての画像処理装置は、教師画像と、前記教師画像に対応するカメラの位置を取得する取得手段と、前記カメラの前記位置を用いて前記教師画像の各画素に対応する光線を算出する光線算出手段と、前記光線における点をサンプリングして前記教師画像を用いることで機械学習を行い、学習パラメータを算出する学習パラメータ算出手段とを有し、前記教師画像の被写界深度の範囲内における前記光線のサンプリング密度は、前記被写界深度の範囲外の前記サンプリング密度よりも高い。 An image processing device according to one aspect of the present invention comprises: an acquisition means for acquiring a teacher image and a camera position corresponding to the teacher image; a ray calculation means for calculating a ray corresponding to each pixel of the teacher image using the camera position; and a learning parameter calculation means for sampling points on the ray and performing machine learning using the teacher image to calculate learning parameters, wherein the sampling density of the ray within the depth of field range of the teacher image is higher than the sampling density outside the depth of field range.

本発明の他の目的及び特徴は、以下の実施形態において説明される。 Other objects and features of the present invention are described in the following embodiments.

本発明によれば、高速にボリュームレンダリングを行うことが可能な画像処理装置、画像処理方法、およびプログラムを提供することができる。 The present invention provides an image processing device, image processing method, and program capable of performing volume rendering at high speed.

第１の実施形態におけるパーソナルコンピュータのブロック図である。FIG. 1 is a block diagram of a personal computer according to a first embodiment. 第１の実施形態における３Ｄモデル学習のフローチャートである。10 is a flowchart of 3D model learning in the first embodiment. 第１の実施形態における教師画像の撮影の説明図である。FIG. 3 is an explanatory diagram of capturing a teacher image in the first embodiment. 第１の実施形態における教師画像とピントマップの説明図である。4A and 4B are explanatory diagrams of a teacher image and a focus map according to the first embodiment. 第１の実施形態における自由視点画像レンダリングのフローチャートである。10 is a flowchart of free viewpoint image rendering in the first embodiment. 第１の実施形態における自由視点カメラの説明図である。FIG. 2 is an explanatory diagram of a free viewpoint camera according to the first embodiment. 第１の実施形態における教師画像とピントマップの説明図である。4A and 4B are explanatory diagrams of a teacher image and a focus map according to the first embodiment. 第１の実施形態における被写体奥行き算出の説明図である。FIG. 4 is an explanatory diagram of subject depth calculation in the first embodiment. 第１の実施形態における三次元点の座標算出の説明図である。FIG. 4 is an explanatory diagram of calculation of coordinates of a three-dimensional point in the first embodiment. 第１の実施形態における教師画像と低解像度ピントマップの説明図である。5A and 5B are explanatory diagrams of a teacher image and a low-resolution focus map according to the first embodiment. 第２の実施形態における教師画像の撮影および自由視点カメラの説明図である。10A and 10B are explanatory diagrams illustrating the capture of a teacher image and a free viewpoint camera in the second embodiment. 第２の実施形態における周辺教師画像とピントマップの説明図である。10A and 10B are explanatory diagrams of a peripheral teacher image and a focus map according to the second embodiment. 第２の実施形態における教師画像とピントマップの説明図である。10A and 10B are explanatory diagrams of a teacher image and a focus map according to the second embodiment. 第３の実施形態におけるパーソナルコンピュータのブロック図である。FIG. 10 is a block diagram of a personal computer according to a third embodiment. 第３の実施形態におけるダイナミック３Ｄモデル学習のフローチャートである。10 is a flowchart of dynamic 3D model learning in the third embodiment. 第３の実施形態における教師画像の撮影の説明図である。FIG. 11 is an explanatory diagram of capturing a teacher image in the third embodiment. 第３の実施形態における教師画像の撮影の説明図である。FIG. 11 is an explanatory diagram of capturing a teacher image in the third embodiment. 第３の実施形態における教師画像とピントマップの説明図である。13A and 13B are explanatory diagrams of a teacher image and a focus map according to the third embodiment. 第３の実施形態における教師画像とピントマップの説明図である。13A and 13B are explanatory diagrams of a teacher image and a focus map according to the third embodiment. 第３の実施形態における自由視点動画レンダリングのフローチャートである。13 is a flowchart of free viewpoint video rendering according to the third embodiment.

以下、本発明の実施形態について、図面を参照しながら詳細に説明する。 Embodiments of the present invention will be described in detail below with reference to the drawings.

（第１の実施形態）
まず、図１を参照して、本発明の第１の実施形態におけるパーソナルコンピュータ（画像処理装置）について説明する。図１は、パーソナルコンピュータ（画像処理装置）１００のブロック図である。なお本実施形態は、画像処理装置の例としてパーソナルコンピュータを説明するが、これに限定されるものではなく、パーソナルコンピュータ以外の画像処理装置にも適用可能である。 (First embodiment)
First, a personal computer (image processing device) according to a first embodiment of the present invention will be described with reference to Fig. 1. Fig. 1 is a block diagram of a personal computer (image processing device) 100. Note that this embodiment will be described using a personal computer as an example of an image processing device, but the present invention is not limited to this and can also be applied to image processing devices other than personal computers.

制御部１０１は、例えばＣＰＵであり、パーソナルコンピュータ１００が備える各ブロックの動作プログラムをＲＯＭ１０２より読み出し、ＲＡＭ１０３に展開して実行することによりパーソナルコンピュータ１００が備える各ブロックの動作を制御する。ＲＯＭ１０２は、ＳＳＤ等の書き換え可能な不揮発性メモリであり、パーソナルコンピュータ１００が備える各ブロックの動作プログラムに加え、各ブロックの動作に必要なパラメータ等を記憶する。ＲＡＭ１０３は、ＤＲＡＭ等の書き換え可能な揮発性メモリであり、パーソナルコンピュータ１００が備える各ブロックの動作において出力されたデータの一時的な記憶領域として用いられる。データストレージ部１０４は、機械学習のために必要な画像データや画像ごとのメタデータ等を読み書きする、ハードディスク等で構成された記録媒体である。 The control unit 101 is, for example, a CPU, which reads out the operation programs for each block of the personal computer 100 from the ROM 102, expands them into the RAM 103, and executes them to control the operation of each block of the personal computer 100. The ROM 102 is a rewritable non-volatile memory such as an SSD, and stores the operation programs for each block of the personal computer 100 as well as parameters required for the operation of each block. The RAM 103 is a rewritable volatile memory such as a DRAM, and is used as a temporary storage area for data output during the operation of each block of the personal computer 100. The data storage unit 104 is a recording medium such as a hard disk that reads and writes image data required for machine learning, metadata for each image, etc.

撮影カメラ位置姿勢推定部１０５は、画像データ群から公知のＳｆＭ（ＳｔｒｕｃｔｕｒｅｆｒｏｍＭｏｔｉｏｎ）等の技術を用いて、画像ごとに撮影カメラの位置姿勢を推定する。すなわち撮影カメラ位置姿勢推定部１０５は、教師画像と、教師画像に対応するカメラの位置を取得する取得手段である。 The shooting camera position and orientation estimation unit 105 estimates the position and orientation of the shooting camera for each image from the image data group using well-known techniques such as SfM (Structure from Motion). In other words, the shooting camera position and orientation estimation unit 105 is an acquisition means for acquiring teacher images and the positions of the cameras corresponding to the teacher images.

光線算出部１０６は、ボリュームレンダリングする際の光線を、例えば非特許文献１に開示されているような手法で算出する。非特許文献１において、ボリュームレンダリングする光線はｒ（ｔ）＝ｏ＋ｔｄのように定義される。ここで、ｏは世界座標系におけるカメラの主点、ｄは世界座標系で表現される光線の方向ベクトル、ｔはカメラ主点から光線上のサンプリング点までの距離である。光線方向ベクトルｄは、カメラの主点から像面上の各画素へ向かう三次元ベクトルを計算することで求められる。また、カメラ主点ｏおよび光線方向ベクトルｄは、カメラ位置姿勢パラメータによりカメラ座標系から世界座標系に座標変換される。すなわち光線算出部１０６は、学習時においてカメラの位置を用いて教師画像の各画素に対応する光線を算出し、推論時においてカメラ位置姿勢を用いて任意視点カメラの各画素に対応する光線を算出する光線算出手段である。 The ray calculation unit 106 calculates the rays used in volume rendering using a method such as that disclosed in Non-Patent Document 1. In Non-Patent Document 1, the rays used in volume rendering are defined as r(t) = o + td. Here, o is the camera's principal point in the world coordinate system, d is the ray's direction vector expressed in the world coordinate system, and t is the distance from the camera's principal point to the sampling point on the ray. The ray direction vector d is found by calculating a three-dimensional vector pointing from the camera's principal point to each pixel on the image plane. Furthermore, the camera's principal point o and ray direction vector d are coordinate-transformed from the camera coordinate system to the world coordinate system using the camera position and orientation parameters. In other words, the ray calculation unit 106 is a ray calculation unit that calculates rays corresponding to each pixel of the teacher image using the camera position during learning, and calculates rays corresponding to each pixel of the arbitrary viewpoint camera using the camera position and orientation during inference.

ニューラルネットワーク部１０７は、光線ごとに光線上の点をサンプリングし、対応する点の色と密度をニューラルネットワークにより演算し、対象空間にわたってボリュームレンダリングすることで各光線に対応する画素の色を決定する。学習及び推論は、非特許文献１で開示されているような手法を用いればよい。学習時は、ボリュームレンダリングで算出された色と撮影画像の色とのＬ２損失を損失関数として誤差逆伝搬法により学習重みを収束させる。すなわちニューラルネットワーク部１０７は、学習時において、光線における点をサンプリングして教師画像を用いることで機械学習を行い、学習パラメータを算出する学習パラメータ算出手段である。推論時は、自由カメラ位置姿勢における各光線上をボリュームレンダリンすることで自由視点画像をレンダリングする。すなわちニューラルネットワーク部１０７は、推論時において、学習パラメータ算出手段により光線における点をサンプリングして事前学習された学習パラメータを用いて、機械学習によりカメラの画像をレンダリングするレンダリング手段である。 The neural network unit 107 samples points on each ray, calculates the color and density of the corresponding points using a neural network, and performs volume rendering across the target space to determine the color of the pixel corresponding to each ray. For learning and inference, a technique such as that disclosed in Non-Patent Document 1 may be used. During learning, the learning weights are converged using backpropagation, with the L2 loss between the color calculated by volume rendering and the color of the captured image used as a loss function. That is, during learning, the neural network unit 107 is a learning parameter calculation means that performs machine learning by sampling points on ray and using training images to calculate learning parameters. During inference, a free-viewpoint image is rendered by volume-rendering each ray at the free camera position and orientation. That is, during inference, the neural network unit 107 is a rendering means that renders the camera image through machine learning using pre-trained learning parameters sampled by a learning parameter calculation means.

ＦΘ：（ｘ，ｄ）→（ｃ，σ）・・・（１）
式（１）において、ＦΘはマルチ・レイヤー・パーセプトロンからなるニューラルネットワークであり、サンプリングされる光線上の点の三次元座標ｘ、光線方向ベクトルｄを入力とする。ＦΘは各サンプリング点に対して、ＲＧＢの色ｃと密度σを出力する。光線ｒに対するボリュームレンダリングはＣｖｒ（ｒ）＝ΣＴｉ（１－ｅｘｐ（－σｉ・δｉ））ｃｉ，ｉ＝１～Ｎで表される。ここで、Ｎはサンプリング数、ｉはサンプリング点毎のインデックス番号、ｃｉおよびσｉはインデックスｉに対応する色および密度、δｉ＝ｔ（ｉ＋１）－ｔ（ｉ）、ｔ（ｉ）はインデックスｉに対応するカメラ主点から光線上のサンプリング点までの距離である。また、Ｔｉ＝ｅｘｐ（－Σσｊ・δｊ）、ｊ＝１～ｉ－１であり、これによりオクルージョンで遮蔽されたオブジェクト色の影響を除外している。 FΘ: (x, d) → (c, σ) ... (1)
In equation (1), FΘ is a neural network consisting of a multi-layer perceptron. It takes as input the three-dimensional coordinate x of the point on the sampled ray and the ray direction vector d. FΘ outputs the RGB color c and density σ for each sampling point. Volume rendering for ray r is expressed as Cvr(r) = ΣTi(1 - exp(-σi δi))ci, i = 1 to N. Here, N is the number of samples, i is the index number for each sampling point, ci and σi are the color and density corresponding to index i, and δi = t(i+1) - t(i), where t(i) is the distance from the camera principal point corresponding to index i to the sampling point on the ray. Also, Ti = exp(-Σσj δj), j = 1 to i-1, which eliminates the influence of the color of objects occluded by occlusion.

自由カメラ位置姿勢取得部１０８は、自由視点画像のためのカメラ位置姿勢（任意視点カメラの位置）を取得する取得手段である。予め外部装置で算出されたカメラ位置姿勢パラメータを取得してもよく、または、ジョイスティック等の操作部材を介してユーザが指示したカメラ位置姿勢を取得してもよい。 The free camera position and orientation acquisition unit 108 is a means for acquiring the camera position and orientation (position of the arbitrary viewpoint camera) for the free viewpoint image. It may acquire camera position and orientation parameters calculated in advance by an external device, or it may acquire the camera position and orientation specified by the user via an operating member such as a joystick.

次に、図２を参照して、制御部１０１による３Ｄモデル学習について説明する。図２は、３Ｄモデル学習のフローチャートである。まずステップＳ２０１において、撮影カメラ位置姿勢推定部１０５は、学習に使用する各画像に対応するカメラ位置姿勢を推定する。学習は、イタレーション処理により学習重みを更新させることで、目標の学習重みに収束させる。続いてステップＳ２０２において、制御部１０１は、イタレーション処理が完了したか否かを判定する。イタレーション処理が完了した場合、本フローを終了する。一方、イタレーション処理が完了していない場合、ステップＳ２０３に進む。 Next, 3D model learning by the control unit 101 will be described with reference to FIG. 2. FIG. 2 is a flowchart of 3D model learning. First, in step S201, the shooting camera position and orientation estimation unit 105 estimates the camera position and orientation corresponding to each image used for learning. Learning converges to the target learning weights by updating the learning weights through iteration processing. Next, in step S202, the control unit 101 determines whether the iteration processing is complete. If the iteration processing is complete, this flow ends. On the other hand, if the iteration processing is not complete, proceed to step S203.

ステップＳ２０３において、制御部１０１は、各イタレーションにおいて、まずバッチサイズ分の光線をランダムに選択する。バッチサイズは、例えば非特許文献１に示されている光線数４０９６のように設定すればよい。また本実施形態において、このとき、３Ｄモデリングに適さない光線を除外して光線を選択することで、効率的に演算コストを低減させることができる。その動作について、図３および図４を参照して説明する。 In step S203, the control unit 101 first randomly selects rays equal to the batch size in each iteration. The batch size may be set to, for example, 4096 rays, as shown in Non-Patent Document 1. In this embodiment, the calculation cost can be efficiently reduced by selecting rays while excluding rays that are not suitable for 3D modeling. This operation will be explained with reference to Figures 3 and 4.

図３は、教師画像の撮影の説明図であり、被写体、撮影空間、および撮影カメラの関係を説明するための鳥観図を示す。３０１は撮影カメラ（撮像装置）、３０２は３Ｄモデリング対象の撮影空間の範囲、３０３は主被写体、３０４は背景被写体である。撮影カメラ３０１のピントは主被写体３０３に合焦されており、ハッチング部３０５で示される画角および被写界深度内で合焦されている。撮影カメラ３０１は、撮影空間範囲３０２内の被写体を様々な方向から撮影するために複数配置されるが、図３では簡単のために撮影カメラ３０１のみを示している。 Figure 3 is an explanatory diagram of capturing a teacher image, showing a bird's-eye view to explain the relationship between the subject, the capturing space, and the capturing camera. 301 is the capturing camera (imaging device), 302 is the range of the capturing space to be 3D modeled, 303 is the main subject, and 304 is the background subject. The capturing camera 301 is focused on the main subject 303, and is focused within the angle of view and depth of field indicated by the hatched area 305. Multiple capturing cameras 301 are positioned to capture the subject within the capturing space range 302 from various directions, but for simplicity, only capturing camera 301 is shown in Figure 3.

図４は、教師画像とピントマップ（距離分布情報）の説明図であり、撮影カメラ３０１で取得された画像４０１および付帯するメタデータであるピントマップ４０２を説明する図を示す。図４において、撮像面の合焦度合いをグレースケールマップの形式で示しており、手前が白、奥が黒、５０％グレーが合焦を示している。ピントマップ４０２は、例えば特許文献２に開示されているように、全画素が位相差画素からなる撮像センサから得られる位相差画像から撮像面におけるデフォーカスマップをピントマップとして取得するように構成すればよい。 Figure 4 is an explanatory diagram of a teacher image and a focus map (distance distribution information), showing an image 401 captured by the imaging camera 301 and a focus map 402, which is associated metadata. In Figure 4, the degree of focus on the imaging surface is shown in the form of a grayscale map, with white in the foreground, black in the background, and 50% gray indicating in-focus. As disclosed in Patent Document 2, for example, the focus map 402 can be configured to obtain a defocus map on the imaging surface as a focus map from a phase-difference image obtained from an imaging sensor in which all pixels are phase-difference pixels.

図３中の光線３０６、３０７は、ボリュームレンダリング対象の光線であるが、光線３０６上には被写界深度の範囲内にある被写体が存在しないため、除外する。不要な光線であるか否かの判定は、図４のピントマップ４０２に基づいて判定することが可能である。光線３０７に対応する画素４０４が５０％グレーの合焦画素であるのに対して、光線３０６に対応する画素４０３は、被写界深度の範囲外の濃いグレーであることから、除外対象の光線であると判定する。このように、不要な光線を除外することで、ボリュームレンダリング処理を高速化することができる。本実施形態において、取得手段は、教師画像に対応する距離分布情報（ピントマップ）を取得し、光線算出手段は、距離分布情報に基づいて、光線が教師画像の被写界深度の範囲内にあるか否かを判定する。 Light rays 306 and 307 in Figure 3 are light rays to be subjected to volume rendering, but since there is no subject within the depth of field on ray 306, it is excluded. Whether or not it is an unnecessary light ray can be determined based on focus map 402 in Figure 4. While pixel 404 corresponding to ray 307 is an in-focus pixel of 50% gray, pixel 403 corresponding to ray 306 is a dark gray outside the depth of field, and is therefore determined to be a light ray to be excluded. In this way, by excluding unnecessary light rays, the volume rendering process can be speeded up. In this embodiment, the acquisition means acquires distance distribution information (focus map) corresponding to the teacher image, and the light ray calculation means determines whether or not the light ray is within the depth of field of the teacher image based on the distance distribution information.

なお本実施形態において、被写界深度の範囲外の光線は除外するが、これに限定されるものではない。光線算出部１０６が被写界深度の内側と外側で光線の粗密を変えることで、効率的に光線を選択する（光線算出手段が教師画像の被写界深度内にある光線を重点的に選択する）ようにしてもよい。例えば、被写界深度の範囲内の光線に関しては全ての光線を選択し、被写界深度の範囲外の光線に関しては対象カメラの全光線の１０％以下の光線を選択することができる。 In this embodiment, light rays outside the depth of field are excluded, but this is not limited to this. The light ray calculation unit 106 may select light rays efficiently by varying the density of light rays inside and outside the depth of field (the light ray calculation means may select light rays that are within the depth of field of the teacher image with a focus on that light). For example, it is possible to select all light rays within the depth of field, and to select light rays outside the depth of field that are 10% or less of the total light rays of the target camera.

図２のステップＳ２０３にてバッチサイズ分の光線が選択された後、ステップＳ２０４において、光線算出部１０６は光線を算出する。ボリュームレンダリングする光線は、前述の通り、ｒ（ｔ）＝ｏ＋ｔｄのように定義される。ここで、ｏは世界座標系におけるカメラの主点、ｄは世界座標系で表現される光線の方向ベクトル、ｔはカメラ主点から光線上のサンプリング点までの距離である。 After rays equal to the batch size are selected in step S203 of Figure 2, the ray calculation unit 106 calculates rays in step S204. As described above, the ray for volume rendering is defined as r(t) = o + td. Here, o is the principal point of the camera in the world coordinate system, d is the direction vector of the ray expressed in the world coordinate system, and t is the distance from the camera principal point to the sampling point on the ray.

ボリュームレンダリングのためにサンプリングされる距離ｔの範囲は、図３中のハッチング部３０５で示される被写界深度の範囲内に制限される。ハッチング部３０５の奥行き範囲は、前方被写界深度Ｄｆ、後方被写界深度Ｄｂ、および合焦被写体距離Ｚを用いて、Ｚ－Ｄｆ～Ｚ＋Ｄｂで表される。また、Ｄｆ＝（ｒ・Ａｖ・Ｚ＾２）／（ｆ＾２＋ｒ・Ａｖ・Ｚ）、Ｄｂ＝（ｒ・Ａｖ・Ｚ＾２）／（ｆ＾２－ｒ・Ａｖ・Ｚ）で表される。ここで、ｒは許容錯乱円径、Ａｖは絞り値、ｆは焦点距離である。また許容錯乱円径ｒは、画素ピッチの２倍とする。このように、ボリュームレンダリングのためにサンプリングされる距離ｔの範囲を被写界深度の範囲内に制限することで、ボリュームレンダリング処理を高速化することができる。 The range of distance t sampled for volume rendering is limited to the depth of field indicated by the hatched area 305 in Figure 3. The depth range of hatched area 305 is expressed as Z-Df to Z+Db, using the front depth of field Df, rear depth of field Db, and focused subject distance Z. Furthermore, Df = (r·Av·Z^2)/(f^2+r·Av·Z) and Db = (r·Av·Z^2)/(f^2-r·Av·Z). Here, r is the permissible circle of confusion diameter, Av is the aperture value, and f is the focal length. The permissible circle of confusion diameter r is twice the pixel pitch. In this way, limiting the range of distance t sampled for volume rendering to within the depth of field can speed up the volume rendering process.

なお本実施形態において、許容錯乱円径ｒを画素ピッチの２倍としているが、これに限定されるものではなく、例えば自由視点画像をレンダリングする際の解像度に応じてこれより粗くすること、または細かくしてもよい。すなわち、被写界深度を決定するための許容錯乱円径は、学習パラメータを用いたレンダリングの際の解像度（レンダリング解像度）に基づいて決定されてもよい。 In this embodiment, the allowable circle of confusion diameter r is set to twice the pixel pitch, but this is not limited to this and may be set to a coarser or finer value depending on the resolution used when rendering the free viewpoint image. In other words, the allowable circle of confusion diameter for determining the depth of field may be determined based on the resolution used when rendering using learning parameters (rendering resolution).

また本実施形態において、被写界深度内をＺ－Ｄｆ～Ｚ＋Ｄｂとしているが、これに限定されるものではなく、カメラから取得可能な焦点距離ｆや絞り値Ａｖの誤差を考慮して、Ｚ－２・Ｄｆ～Ｚ＋２・Ｄｂのように、幅に余裕を持った範囲にしてもよい。また本実施形態において、ボリュームレンダリングする光線範囲を被写界深度の範囲内に制限しているが、これに限定されるものではない。被写界深度の範囲内と範囲外とでサンプリングの粗密（サンプリング密度）を変えることで、効率的にサンプリングするようにしてもよい。すなわち、教師画像の被写界深度の範囲内における光線のサンプリング密度を、被写界深度の範囲外のサンプリング密度よりも高くすればよい。例えば、図３中の光線３０７において、３Ｄモデリング対象範囲３０２でカバーされる全範囲から３２点をまずサンプリングし、ハッチング部３０５でカバーされる範囲に対しては追加で１２８点をサンプリングする。このように、被写界深度内だけサンプリング点を密に配置するようにしてもよい。 In this embodiment, the depth of field is defined as Z-Df to Z+Db, but this is not limited to this. Taking into account errors in the focal length f and aperture value Av obtainable from the camera, a more generous range, such as Z-2·Df to Z+2·Db, may be used. In this embodiment, the ray range for volume rendering is limited to the depth of field, but this is not limited to this. Efficient sampling may be achieved by varying the sampling density within and outside the depth of field. In other words, the sampling density of rays within the depth of field of the teacher image may be set higher than the sampling density outside the depth of field. For example, for ray 307 in Figure 3, 32 points are first sampled from the entire range covered by the 3D modeling target range 302, and an additional 128 points are sampled in the range covered by the hatched area 305. In this way, sampling points may be densely arranged only within the depth of field.

図２のステップＳ２０４にて光線が算出した後、ステップＳ２０５において、ニューラルネットワーク部１０７は、学習重みを更新する。制御部１０１は、ステップＳ２０２～Ｓ２０５を学習重みが収束するまで繰り返すことで、学習重みを決定する。なお、ステップＳ２０２のイタレーション完了の判定については、例えば非特許文献１に開示されているように、１００－３００Ｋイタレーションの回数に達したか否かで判定するようにすればよい。 After calculating the rays in step S204 of FIG. 2, the neural network unit 107 updates the learning weights in step S205. The control unit 101 determines the learning weights by repeating steps S202 to S205 until the learning weights converge. Note that the completion of the iterations in step S202 can be determined by determining whether 100-300K iterations have been performed, as disclosed in Non-Patent Document 1, for example.

次に、図５を参照して、制御部１０１による自由視点画像レンダリングについて説明する。図５は、自由視点画像レンダリングのフローチャートである。まずステップＳ５０１において、自由カメラ位置姿勢取得部１０８は、レンダリングする自由視点のカメラ位置姿勢を取得する。続いてステップＳ５０２において、制御部１０１は、ボリュームレンダリングによる画素値（ＲＧＢ値）の算出がレンダリング画像の全画素に関して完了したか否かを判定する。全画素に対して画素値の算出が完了した場合、本フローを終了する。一方、全画素に対して画素値の算出が完了していない場合、ステップＳ５０３に進む。 Next, free viewpoint image rendering by the control unit 101 will be described with reference to FIG. 5. FIG. 5 is a flowchart of free viewpoint image rendering. First, in step S501, the free camera position and orientation acquisition unit 108 acquires the camera position and orientation of the free viewpoint to be rendered. Next, in step S502, the control unit 101 determines whether calculation of pixel values (RGB values) by volume rendering has been completed for all pixels of the rendered image. If calculation of pixel values has been completed for all pixels, this flow ends. On the other hand, if calculation of pixel values has not been completed for all pixels, the process proceeds to step S503.

ステップＳ５０３において、制御部１０１は、画素ごとに対応する三次元点が学習画像の被写界深度内にあるか否か、すなわち光線が被写界深度内の光線か否かを判定する。三次元点が被写界深度内ではない場合、ステップＳ５０２へ戻る。一方、三次元点が被写界深度内である場合、ステップＳ５０３に進み、ボリュームレンダリングを実行する。なお、被写界深度外の光線に対応する画素には、例えば黒など固定の画素値を割り当てる。 In step S503, the control unit 101 determines whether the corresponding 3D point for each pixel is within the depth of field of the training image, i.e., whether the ray is within the depth of field. If the 3D point is not within the depth of field, the process returns to step S502. On the other hand, if the 3D point is within the depth of field, the process proceeds to step S503, where volume rendering is performed. Note that pixels corresponding to rays outside the depth of field are assigned a fixed pixel value, such as black.

図６は、自由視点カメラの説明図であり、被写体、撮影空間および各カメラの関係を説明する鳥観図を示す。図７は、教師画像とピントマップの説明図であり、撮影カメラ６０１で取得された画像７０１および付帯するメタデータであるピントマップ７０２を説明する図を示す。図６中の６０３は、レンダリングする自由視点カメラ、３０１、６０１は自由視点カメラ６０３に隣接する撮影カメラである。 Figure 6 is an explanatory diagram of a free viewpoint camera, showing a bird's-eye view illustrating the relationship between the subject, the shooting space, and each camera. Figure 7 is an explanatory diagram of a teacher image and a focus map, showing an image 701 acquired by the shooting camera 601 and a focus map 702, which is the accompanying metadata. In Figure 6, 603 is the free viewpoint camera that performs rendering, and 301 and 601 are shooting cameras adjacent to the free viewpoint camera 603.

図４中の画素４０４および図７中の画素７０４は、同じ３次元点である図６中の６０７を表している。また、自由視点カメラ６０３の光線６０５と、撮影カメラ３０１の光線３０７は、同じ３次元点である図６の６０７を表している。三次元点６０７の三次元座標が事前にわかっていれば、光線６０５が光線３０７に対応していることを確認することができ、図４のピントマップ４０２から求めることができる。 Pixel 404 in Figure 4 and pixel 704 in Figure 7 represent the same three-dimensional point, 607 in Figure 6. Furthermore, ray 605 from free viewpoint camera 603 and ray 307 from shooting camera 301 represent the same three-dimensional point, 607 in Figure 6. If the three-dimensional coordinates of three-dimensional point 607 are known in advance, it can be confirmed that ray 605 corresponds to ray 307, and this can be determined from focus map 402 in Figure 4.

図８は、被写体奥行き算出の説明図であり、ピントマップが示す画素位置毎のデフォーカス値ｄｅｆから、被写体の奥行Ｚ＋ΔＺを算出する手順を説明する図を示す。８０１は結像光学系、８０２は撮像面位置、８０３はピント被写体距離位置、８０４はデフォーカス結像位置、８０５は被写体距離位置である。レンズの公式から、１／Ｚ＋１／Ｚ‘＝１／ｆおよび１／（Ｚ＋ΔＺ）＋１／（Ｚ‘＋ｄｅｆ）＝１／ｆが成立するため、これらから被写体の奥行Ｚ＋ΔＺを算出することができる（ｆ：焦点距離）。 Figure 8 is an explanatory diagram of subject depth calculation, illustrating the procedure for calculating the subject depth Z + ΔZ from the defocus value def for each pixel position indicated by the focus map. 801 is the imaging optical system, 802 is the image plane position, 803 is the focused subject distance position, 804 is the defocused image position, and 805 is the subject distance position. From the lens formula, 1/Z + 1/Z' = 1/f and 1/(Z + ΔZ) + 1/(Z' + def) = 1/f hold, so the subject depth Z + ΔZ can be calculated from these (f: focal length).

被写体の奥行Ｚ＋ΔＺが分かれば、三次元点の座標算出の説明図である図９に示される三角形の等比関係から、ｘ／ｆ＝Ｘ／（Ｚ＋ΔＺ）およびy／ｆ＝Y／（Ｚ＋ΔＺ）が成立する。これらから、三次元点６０７のＸ座標、Ｙ座標およびＺ座標（Ｚ＋ΔＺ）を求めることができる。 If the depth Z + ΔZ of the subject is known, then x/f = X/(Z + ΔZ) and y/f = Y/(Z + ΔZ) hold true from the geometric relationship of the triangle shown in Figure 9, which is an explanatory diagram of calculating the coordinates of three-dimensional points. From these, the X, Y, and Z coordinates (Z + ΔZ) of three-dimensional point 607 can be determined.

上記から光線６０５が光線３０７に対応していることを確認することができ、学習時の光線３０７は被写界深度内にあることが分かっているので、光線６０５は、ボリュームレンダリングの対象とする。一方、図４の画素４０３および図７の画素７０３は、同じ３次元点である図６の６０６を表しており、それぞれ図６の光線３０６と光線６０４が対応している。しかし、学習時の光線３０６は被写界深度外であることが分かっているうえに、撮影カメラ６０１は対応する三次元点６０６をとらえていないため、光線６０４はボリュームレンダリンの対象からは除外する。このように、学習されていない３次元点に対応する光線をボリュームレンダリングの対象外とすることで、自由視点画像のボリュームレンダリング処理を高速化することができる。 From the above, it can be confirmed that ray 605 corresponds to ray 307, and since it is known that ray 307 during training is within the depth of field, ray 605 is subject to volume rendering. On the other hand, pixel 403 in Figure 4 and pixel 703 in Figure 7 represent the same 3D point, 606 in Figure 6, and correspond to ray 306 and ray 604 in Figure 6, respectively. However, since it is known that ray 306 during training is outside the depth of field and the shooting camera 601 did not capture the corresponding 3D point 606, ray 604 is excluded from volume rendering. In this way, by excluding rays corresponding to 3D points that have not been trained from volume rendering, it is possible to speed up the volume rendering process for free viewpoint images.

図５のステップＳ５０３にて被写界深度内と判定された光線に関して、ステップＳ５０４において、光線算出部１０６は光線を算出する。ボリュームレンダリングする光線は、前述の通り、ｒ（ｔ）＝ｏ＋ｔｄのように定義される。ここで、ｏは世界座標系におけるカメラの主点、ｄは世界座標系で表現される光線の方向ベクトル、ｔはカメラ主点から光線上のサンプリング点までの距離である。ボリュームレンダリングのためにサンプリングされる距離ｔの範囲は、図６のハッチング部３０５およびハッチング部６０２で示される隣接する撮影カメラ３０１、６０１の被写界深度内に制限される。このように、ボリュームレンダリングのためにサンプリングされる距離ｔの範囲を隣接する撮影カメラの被写界深度内に制限することで、自由視点画像のボリュームレンダリング処理を高速化することができる。 In step S504, the ray calculation unit 106 calculates the ray for a ray determined to be within the depth of field in step S503 of FIG. 5. As described above, the ray for volume rendering is defined as r(t) = o + td. Here, o is the principal point of the camera in the world coordinate system, d is the direction vector of the ray expressed in the world coordinate system, and t is the distance from the camera principal point to the sampling point on the ray. The range of distance t sampled for volume rendering is limited to the depth of field of the adjacent capturing cameras 301 and 601, as indicated by hatched areas 305 and 602 in FIG. 6. In this way, by limiting the range of distance t sampled for volume rendering to the depth of field of the adjacent capturing cameras, the volume rendering process for free viewpoint images can be accelerated.

図５のステップＳ５０４にて光線が算出した後、ステップＳ５０５において、ニューラルネットワーク部１０７は、ボリュームレンダリング処理を実行し、対応する画素値（ＲＧＢ値）を算出する。 After the rays are calculated in step S504 of Figure 5, in step S505, the neural network unit 107 performs volume rendering processing and calculates the corresponding pixel values (RGB values).

なお本実施形態において、撮影カメラの画像（教師画像）とピントマップ（距離分布情報）の解像度は同じであるが、これに限定されるものではなく、異なる解像度であってもよい。例えば、距離分布情報プを教師画像よりも解像度を低くしてもよい。ピントマップのように視差マップを元に生成されるマップは、ステレオ対応点探索のために所定サイズのテンプレートマッチングが行われるため、通常テンプレートサイズ分だけマップサイズは小さくなる。例えばテンプレートサイズが１６×１６画素であれば、通常マップサイズは縦横共に１／１６となる。なお、自由視点画像をレンダリングする際もピントマップが必要であるが、このとき保持しておくピントマップを縦横共に１／１６した縮小版のピントマップで保持しておけば、レンダリングに必要なデータ容量を削減することができる。 In this embodiment, the image from the shooting camera (teacher image) and the focus map (distance distribution information) have the same resolution, but this is not limited to this and they may have different resolutions. For example, the resolution of the distance distribution information map may be lower than that of the teacher image. Maps generated based on disparity maps, such as focus maps, are generated by template matching of a predetermined size for stereo correspondence search, so the map size is usually reduced by the size of the template. For example, if the template size is 16 x 16 pixels, the normal map size is 1/16th the size in both height and width. A focus map is also required when rendering a free viewpoint image, but if the focus map stored at this time is a reduced version that is 1/16th the size in both height and width, the data volume required for rendering can be reduced.

図１０は、教師画像と低解像度ピントマップの説明図であり、撮影カメラ３０１で取得された画像４０１および付帯するメタデータであるピントマップ１００１を説明する図を示す。ピントマップ１００１は、テンプレートマッチングのテンプレートサイズ分（１６×１６画素）だけ解像度が低く、ピントマップの１画素が画像の１６×１６画素の領域に対応している。また、図３を参照して説明した通り、３０３は主被写体、３０４は背景被写体である。 Figure 10 is an explanatory diagram of a teacher image and a low-resolution focus map, showing an image 401 acquired by the imaging camera 301 and a focus map 1001, which is the accompanying metadata. The focus map 1001 has a lower resolution by the template size (16 x 16 pixels) for template matching, with one pixel on the focus map corresponding to a 16 x 16 pixel area of the image. Also, as explained with reference to Figure 3, 303 is the main subject and 304 is the background subject.

画像４０１上の画素１００３は、ピントマップ１００１上の画素１００２に対応している。画素１００３は、主被写体３０３を示しているが、テンプレートの１６×１６画素の範囲に主被写体３０３と背景である地面の両方が含まれる。このため、画素１００２のピントマップ画素値（デフォーカス値）は主被写体３０３と地面のデフォーカス値の間の値になる場合がある。そこで、画素１００３のような被写体輪郭領域に関しては、広めに光線を選択するとともに、ボリュームレンダリングのサンプリング範囲も制限しないようにする。すなわち光線算出手段は、距離分布情報に基づいて被写体輪郭領域を特定し、被写体輪郭領域に対して、光線を選択しやすくし、サンプリング範囲を広範囲に設定する。 Pixel 1003 on image 401 corresponds to pixel 1002 on focus map 1001. Pixel 1003 indicates the main subject 303, but the 16 x 16 pixel range of the template includes both the main subject 303 and the ground background. For this reason, the focus map pixel value (defocus value) of pixel 1002 may fall between the defocus values of the main subject 303 and the ground. Therefore, for subject contour regions such as pixel 1003, rays are selected broadly and the sampling range for volume rendering is not restricted. In other words, the ray calculation means identifies the subject contour region based on distance distribution information, makes it easier to select rays for the subject contour region, and sets a wide sampling range.

本実施形態において、撮影カメラのレンズによる歪曲はごくわずかとし、学習画像に歪曲補正を行わないが、これに限定されるものではなく、歪曲がある学習画像を利用するようにしてもよい。その場合、正しい光線が算出できるように、学習画像には歪曲補正を行う。さらに、画像とペアで参照されるピントマップに対しても歪曲補正を行うことで、正しく光線選択ができるようにする。すなわち取得手段は、カメラの光学系の歪曲成分に基づいて、教師画像および距離分布情報に対する処理を行ってもよい。 In this embodiment, distortion due to the lens of the shooting camera is assumed to be very slight, and distortion correction is not performed on the training images, but this is not limited to this, and training images with distortion may also be used. In this case, distortion correction is performed on the training images so that correct light rays can be calculated. Furthermore, distortion correction is also performed on the focus map that is referenced in pairs with the image, so that correct light rays can be selected. In other words, the acquisition means may process the teacher image and distance distribution information based on the distortion components of the camera's optical system.

本実施形態において、データストレージ部１０４に保持するピントマップの画素値はデフォーカス値であるが、これに限定されるものではなく、視差値や距離値でもよい。また、いずれの形式であってもボリュームレンダリングの範囲を決定する際に距離値に変換できればよい。また、デフォーカス値で保持する際には、レンズの偏心や撮像素子の傾きによるピントずれを予め撮像カメラで補正したうえでデフォーカス値をピントマップとして記録し、データストレージ部１０４に保持してもよい。すなわち距離分布情報は、視差を表すシフト量に基づくマップ（視差マップ）、デフォーカス量に基づくマップ（デフォーカスマップ）、または距離に基づくマップ（距離マップ）の少なくとも一つを含んでいればよい。 In this embodiment, the pixel values of the focus map stored in the data storage unit 104 are defocus values, but this is not limited to this and they may also be parallax values or distance values. Furthermore, any format may be used as long as it can be converted into distance values when determining the range of volume rendering. Furthermore, when storing defocus values, focus errors due to lens decentering or tilt of the image sensor may be corrected in advance by the imaging camera, and the defocus values may be recorded as a focus map and stored in the data storage unit 104. In other words, the distance distribution information may include at least one of a map based on the shift amount representing parallax (parallax map), a map based on the defocus amount (defocus map), or a map based on distance (distance map).

本実施形態において、例えば図３に示されるように、ボリュームレンダリングの範囲は、前方被写界深度と後方被写界深度とで定義されるハッチング部３０５であるが、これに限定されるものではない。例えば、ピントマップから被写体表面が被写体の前方または後方のいずれにあるかを判定し、さらにボリュームレンダリングの範囲を制限するようにしてもよい。すなわち、距離分布情報（ピントマップの符号）に基づいて、被写体表面が被写体合焦距離よりも手前（前方）にあるか奥（後方）にあるかを判定し、その判定結果に基づいて、サンプリング密度を変化させてもよい。例えば、図３の被写体３０３に関して、ピントマップから被写体表面は前方に存在することが分かるため、ボリュームレンダリングの範囲を前方被写界深度の範囲のみに限定してもよい。 In this embodiment, for example, as shown in Figure 3, the range of volume rendering is the hatched area 305 defined by the front depth of field and the rear depth of field, but is not limited to this. For example, it is possible to determine from the focus map whether the subject surface is in front of or behind the subject, and further restrict the range of volume rendering. That is, it is possible to determine whether the subject surface is in front of (in front of) or behind (behind) the subject focus distance based on the distance distribution information (sign of the focus map), and change the sampling density based on the determination result. For example, for subject 303 in Figure 3, the focus map indicates that the subject surface is in front, so the range of volume rendering may be restricted to only the range of the front depth of field.

また本実施形態において、ボリュームレンダリング範囲は、ピント被写体距離を基準として決定されるが、これに限定されるものではない。例えば、被写体表面位置を基準として所定範囲をボリュームレンダリング範囲とするようにしてもよい。すなわち、距離分布情報に基づいて被写体表面までの距離を決定し、被写体表面までの距離を基準としてサンプリング密度を決定してもよい。被写体表面の奥行位置については、図８を参照して説明した手法で決定すればよい。 In addition, in this embodiment, the volume rendering range is determined based on the focus-to-subject distance, but this is not limited to this. For example, a predetermined range may be set as the volume rendering range based on the subject surface position. That is, the distance to the subject surface may be determined based on distance distribution information, and the sampling density may be determined based on the distance to the subject surface. The depth position of the subject surface may be determined using the method described with reference to Figure 8.

本実施形態において、自由視点画像レンダリング時に被写界深度内の光線のみを用いて（ステップＳ５０３）ボリュームレンダリングするため、被写界深度外の光線に対応する画素には例えば黒など固定の画素値を割り当てるが、これに限定されるものではない。背景など被写界深度外の画素だけ別の方法でレンダリングするようにしてもよい。例えば、背景に全天球の環境テクスチャを張り付けてもよく、または、背景は全く別の背景３Ｄモデルからレンダリングしてもよい。 In this embodiment, volume rendering is performed using only rays within the depth of field during free viewpoint image rendering (step S503), so pixels corresponding to rays outside the depth of field are assigned a fixed pixel value, such as black, but this is not limited to this. Only pixels outside the depth of field, such as the background, may be rendered using a different method. For example, a spherical environmental texture may be applied to the background, or the background may be rendered from an entirely different background 3D model.

本実施形態において、画面全面に亘りピントマップが評価できるが、これに限定されるものではなく、ピントマップが演算できない低信頼領域を定義し、後処理で不具合が起きないようにしてもよい。そのような場合、低信頼領域については、全ての対象光線を選択するようにすればよい。すなわち光線算出手段は、距離分布情報に基づいて信頼性が低いと判定した領域（低信頼領域）に対して、全ての光線を選択し、サンプリング範囲を広範囲に設定することができる。 In this embodiment, the focus map can be evaluated across the entire screen, but this is not limited to this. Low-reliability areas where the focus map cannot be calculated can be defined to prevent problems in post-processing. In such cases, all target rays can be selected for low-reliability areas. In other words, the ray calculation means can select all rays for areas (low-reliability areas) that are determined to have low reliability based on distance distribution information, and set a wide sampling range.

本実施形態によれば、ボリュームレンダリングする光線およびサンプリング範囲を適切に制限することができ、ボリュームレンダリング処理を高速化することができる。 According to this embodiment, the rays and sampling range for volume rendering can be appropriately limited, thereby speeding up the volume rendering process.

（第２の実施形態）
次に、本発明の第２の実施形態における画像処理装置について説明する。本実施形態において、取得手段は、更に周辺空間（背景被写体を含む空間）の周辺教師画像を取得し、学習パラメータ算出手段は、教師画像と周辺教師画像とを用いることで機械学習を行い、学習パラメータを算出する。なお、本実施形態における画像処理装置の構成は、図１を参照して第１の実施形態にて説明したパーソナルコンピュータ１００の構成と同様である。また、図１の制御部１０１が実行する図２に示される３Ｄモデル学習の動作および図５の自由視点画像レンダリングの動作を説明するためのフローチャートも第１の実施形態と同様である。 Second Embodiment
Next, an image processing device according to a second embodiment of the present invention will be described. In this embodiment, the acquisition means further acquires peripheral teacher images of the peripheral space (space including the background subject), and the learning parameter calculation means performs machine learning using the teacher images and the peripheral teacher images to calculate learning parameters. Note that the configuration of the image processing device according to this embodiment is the same as the configuration of the personal computer 100 described in the first embodiment with reference to FIG. 1. In addition, the flowcharts for explaining the 3D model learning operation shown in FIG. 2 and the free viewpoint image rendering operation shown in FIG. 5, which are executed by the control unit 101 in FIG. 1, are also the same as those according to the first embodiment.

図１１は、教師画像の撮影および自由視点カメラの説明図であり、被写体、撮影空間、撮影カメラおよび自由視点カメラの関係を説明する鳥観図を示す。１１０１、１１０５は撮影カメラ、１１０３は自由視点カメラ、３０２は３Ｄモデリング対象の撮影空間の範囲、３０３は主被写体、３０４は背景被写体である。撮影カメラ１１０１は、広角の焦点距離を有し、ピントは背景被写体３０４に合焦されており、ハッチング部１１０２で示される画角および被写界深度内で合焦されている。また撮影カメラ１１０５は、標準画角の焦点距離を有し、ピントは主被写体３０３に合焦されており、ハッチング部１１０６で示される画角および被写界深度内で合焦されている。 Figure 11 is an explanatory diagram of the capture of teacher images and the free viewpoint camera, showing a bird's-eye view illustrating the relationship between the subject, capture space, capture camera, and free viewpoint camera. 1101 and 1105 are the capture cameras, 1103 is the free viewpoint camera, 302 is the range of the capture space for the 3D modeling target, 303 is the main subject, and 304 is the background subject. Capture camera 1101 has a wide-angle focal length and is focused on background subject 304, within the angle of view and depth of field indicated by hatched area 1102. Capture camera 1105 has a standard angle of view focal length and is focused on main subject 303, within the angle of view and depth of field indicated by hatched area 1106.

本実施形態において、撮影空間範囲３０２内の被写体を様々な方向から撮影するため、撮影カメラ１１０１、１１０５以外にも複数の撮影カメラが配置される。具体的には、主被写体３０３を撮影するため撮影カメラ１１０５と同程度の焦点距離および撮影距離（フォーカス位置）を有する撮影カメラを複数配置する。また、主被写体３０３以外の被写体を撮影するため撮影カメラ１１０１と同程度の焦点距離および撮影距離（フォーカス位置）を有する撮影カメラを複数配置する。図１１では、簡単のために、撮影カメラ１１０１、１１０５のみを図示している。 In this embodiment, in order to photograph the subject within the photographing space range 302 from various directions, multiple photographing cameras are arranged in addition to photographing cameras 1101 and 1105. Specifically, multiple photographing cameras with focal lengths and shooting distances (focus positions) similar to those of photographing camera 1105 are arranged to photograph the main subject 303. Additionally, multiple photographing cameras with focal lengths and shooting distances (focus positions) similar to those of photographing camera 1101 are arranged to photograph subjects other than the main subject 303. For simplicity, only photographing cameras 1101 and 1105 are shown in Figure 11.

図１２および図１３は、周辺教師画像または教師画像とピントマップの説明図であり、撮影カメラ１１０１、１１０５で取得された画像１２０１、１３０１、および付帯するメタデータであるピントマップ１２０２、１３０２を説明する図を示す。ピントマップは、第１の実施形態と同様に、公知の技術で算出される。 Figures 12 and 13 are explanatory diagrams of peripheral teacher images or teacher images and focus maps, showing images 1201 and 1301 captured by the photographing cameras 1101 and 1105, and focus maps 1202 and 1302, which are accompanying metadata. As with the first embodiment, the focus maps are calculated using known technology.

撮影カメラ１１０１は、広角な焦点距離で背景被写体３０４に合焦させているため、画像１２０１に示されるように複数被写体が広範囲に撮影され、ピントマップ１２０２に示されるように背景被写体３０４が合焦の５０％グレーを示している。一方、撮影カメラ１１０５は、標準画角の焦点距離で主被写体３０３に合焦させているため、画像１３０１に示されるように主となる被写体がメインで撮影され、ピントマップ１３０２に示されるように主被写体３０３が合焦の５０％グレーを示している。 Since the shooting camera 1101 is focused on the background subject 304 at a wide-angle focal length, multiple subjects are captured over a wide area, as shown in image 1201, and the background subject 304 is shown as 50% gray in focus, as shown in focus map 1202. On the other hand, since the shooting camera 1105 is focused on the main subject 303 at a standard-angle focal length, the main subject is mainly captured, as shown in image 1301, and the main subject 303 is shown as 50% gray in focus, as shown in focus map 1302.

上述のように構成することで、図２の３Ｄモデル学習においてバッチサイズ分の光線を選択する（Ｓ２０４）際に、図１１の標準画角の撮影カメラ１１０５において背景被写体３０４上の三次元点６０６に対応する光線１１１１が被写界深度外で選択されない。一方、広角の撮影カメラ１１０１の光線１１０３は、背景被写体３０４上の三次元点６０６を被写界深度内であるため、対象光線として選択され、主被写体３０３だけではなく背景被写体３０４についても３Ｄモデルに含めることができる。 With the above configuration, when selecting rays for the batch size (S204) in the 3D model learning of Figure 2, ray 1111 corresponding to 3D point 606 on background subject 304 in standard angle of view camera 1105 of Figure 11 is not selected because it is outside the depth of field. On the other hand, ray 1103 of wide-angle camera 1101 is selected as a target ray because 3D point 606 on background subject 304 is within the depth of field, and not only main subject 303 but also background subject 304 can be included in the 3D model.

また、図５のステップＳ５０３の判定の際に、図１１の自由視点カメラ１１０８の光線１１１０は対応する三次元点６０７が隣接する撮影カメラ１１０５の光線１１０７で被写界深度内に捉えられていることが分かるため、対象の光線として選択される。また、図１１の自由視点カメラ１１０８の光線１１０９は、対応する三次元点６０６が隣接する撮影カメラ１１０１の光線１１０３で被写界深度内に捉えられていることが分かるため、対象の光線として選択される。このように、主被写体３０３だけではなく背景被写体３０４についても自由視点画像をレンダリングすることができる。 Furthermore, during the determination in step S503 in Figure 5, ray 1110 of free viewpoint camera 1108 in Figure 11 is selected as the target ray because it is clear that the corresponding 3D point 607 is captured within the depth of field of ray 1107 of adjacent camera 1105. Furthermore, ray 1109 of free viewpoint camera 1108 in Figure 11 is selected as the target ray because it is clear that the corresponding 3D point 606 is captured within the depth of field of ray 1103 of adjacent camera 1101. In this way, free viewpoint images can be rendered not only for the main subject 303 but also for the background subject 304.

本実施形態において、撮影カメラ１１０１、１１０５で焦点距離と撮影距離の両方を異ならせることで主被写体と背景被写体の双方をボリュームレンダリングできるように構成するが、これに限定されるものではなく、撮影距離だけ異ならせるようにしてもよい。背景被写体用の撮影カメラ１１０１の焦点距離を長くすると画角が狭められるため、配置する撮影カメラ数が増加する可能性がある。また、主被写体用の撮影カメラ１１０５の焦点距離を短くすると画角が広がるため、カメラが被写体に寄る必要が発生し、撮影カメラ１１０１に撮影カメラ１１０５が映り込んでしまう可能性がある。一方、焦点距離を合わせておけば、撮影準備やカメラキャリブレーションにおいてメリットがあるため、簡易な撮影に向いている。 In this embodiment, the imaging cameras 1101 and 1105 are configured to have different focal lengths and shooting distances so that both the main subject and background subject can be volume rendered, but this is not limited to this and only the shooting distances may be different. Increasing the focal length of the imaging camera 1101 for the background subject narrows the angle of view, which may result in an increase in the number of imaging cameras required. Furthermore, shortening the focal length of the imaging camera 1105 for the main subject widens the angle of view, which requires the camera to move closer to the subject, and there is a possibility that the imaging camera 1105 will be reflected in the imaging camera 1101. On the other hand, matching the focal lengths has advantages in terms of preparation for imaging and camera calibration, making it suitable for simple imaging.

本実施形態において、撮影カメラ１１０１、１１０５で焦点距離と撮影距離の両方を異ならせることで主被写体と背景被写体の双方をボリュームレンダリング可能に構成するが、これに限定されるものではなく、代わりに絞り値を異ならせるようにしてもよい。すなわち、主被写体用の撮影カメラ１１０５は開放付近の明るい絞り値に設定する一方、背景被写体用の撮影カメラ１１０１は絞り込んで撮影空間範囲３０２全体が被写界深度内に入るように絞り値を設定する。これにより、主被写体用の撮影カメラ１１０５は暗いシャッター秒時を選択し主被写体に動きがある場合でもシャープに撮影できるが、背景用の撮影カメラ１１０１は動きがある被写体に弱くなることや撮影感度が上がってノイズを増やす懸念が生じる。一方、背景被写体が撮影空間範囲３０２内に複数散在しているようなケースでは、撮り漏らしなく３Ｄモデルが生成できるメリットがある。このように周辺教師画像は、少なくともフォーカス位置（撮影距離）、焦点距離、または絞り値のうち少なくとも一つにおいて、教師画像と異なる画像でればよい。 In this embodiment, the imaging cameras 1101 and 1105 are configured to have different focal lengths and shooting distances, enabling volume rendering of both the main subject and background subjects. However, this is not limited to this; instead, different aperture values may be used. That is, the imaging camera 1105 for the main subject is set to a bright aperture value near full aperture, while the imaging camera 1101 for the background subject is narrowed down and set to an aperture value so that the entire imaging space range 302 is within the depth of field. This allows the imaging camera 1105 for the main subject to select a slow shutter speed, allowing for sharp images even when the main subject is moving. However, the imaging camera 1101 for the background may be weak against moving subjects or may have increased imaging sensitivity, resulting in increased noise. On the other hand, in cases where multiple background subjects are scattered within the imaging space range 302, this has the advantage of allowing a 3D model to be generated without missing any images. In this way, the peripheral training images need only differ from the training images in at least one of the focus position (shooting distance), focal length, or aperture value.

本実施形態において、自由視点画像をレンダリングする際もピントマップを利用してボリュームレンダリングを高速化するが、これに限定されるものではない。例えば、３Ｄモデルの学習にはピントマップを利用して高速化する一方、自由視点画像レンダリングは従来技術通り撮影空間範囲にわたって空間サンプリングするようにしてもよい。これにより、自由視点画像レンダリング時にピントマップが不要になる。このため、従来技術によるレンダラーでも自由視点画像をレンダリングすることができるようになり、レンダラーの汎用性を高めることができる。 In this embodiment, a focus map is used to speed up volume rendering even when rendering free-viewpoint images, but this is not limited to this. For example, a focus map may be used to speed up 3D model learning, while free-viewpoint image rendering may be performed by spatial sampling across the entire shooting space range, as in conventional technology. This eliminates the need for a focus map when rendering free-viewpoint images. This means that even renderers using conventional technology can render free-viewpoint images, increasing the versatility of the renderer.

本実施形態において、主被写体も背景被写体も同じニューラルネットワークで３Ｄモデリングするが、これに限定されるものではなく、主被写体と背景被写体で別々のニューラルネットワークを構成してもよい。すなわち学習パラメータ算出手段は、教師画像と周辺教師画像とで異なる機械学習モデルを学習してもよい。これにより、学習効率を向上させることや、背景被写体の細部の再現性を向上させたりすることが期待できる。 In this embodiment, both the main subject and the background subjects are 3D modeled using the same neural network, but this is not limited to this, and separate neural networks may be configured for the main subject and the background subjects. In other words, the learning parameter calculation means may learn different machine learning models for the teacher image and the peripheral teacher images. This is expected to improve learning efficiency and the reproducibility of the details of the background subject.

本実施形態によれば、主被写体と背景被写体とでボリュームレンダリングに最適な光線を選択することができ、主被写体だけでなく背景被写体も含めて高速に３Ｄモデルを生成し自由視点画像を高速にレンダリングすることができる。 According to this embodiment, it is possible to select the optimal light rays for volume rendering for the main subject and background subjects, and to quickly generate 3D models including not only the main subject but also the background subjects, thereby enabling fast rendering of free viewpoint images.

（第３の実施形態）
次に、本発明の第３の実施形態における画像処理装置について説明する。図１４は、パーソナルコンピュータ（画像処理装置）１４００のブロック図である。本実施形態のパーソナルコンピュータ１４００は、ニューラルネットワーク部１４０１を有する点で、第１の実施形態にて説明したニューラルネットワーク部１０７を有するパーソナルコンピュータ１００と異なる。なお、パーソナルコンピュータ１４００の他の構成は、パーソナルコンピュータ１００と同様であるため、それらの説明は省略する。 (Third embodiment)
Next, an image processing device according to a third embodiment of the present invention will be described. Fig. 14 is a block diagram of a personal computer (image processing device) 1400. The personal computer 1400 of this embodiment differs from the personal computer 100 having the neural network unit 107 described in the first embodiment in that it has a neural network unit 1401. Note that other components of the personal computer 1400 are similar to those of the personal computer 100, and therefore description thereof will be omitted.

ニューラルネットワーク部１４０１は、被写体に動きがあるダイナミックシーン（動画）に対応可能である。このようなダイナミックシーンにおいて、光線ごとに光線上の点をサンプリングし対応する点の色と密度をニューラルネットワークにより演算し、対象空間にわたってボリュームレンダリングすることで各光線に対応する画素の色を決定する。学習および推論は、例えば非特許文献２に開示されている技術を用いることができる。学習時は、ボリュームレンダリングで算出された色と撮影画像の色とのＬ２損失を損失関数として誤差逆伝搬法により学習重みを収束させ、推論時は自由カメラ位置姿勢における各光線上をボリュームレンダリンすることで自由視点画像をレンダリングする。 The neural network unit 1401 can handle dynamic scenes (video) where the subject is moving. In such dynamic scenes, points on each ray are sampled, and the color and density of the corresponding points are calculated using a neural network. Volume rendering is then performed across the target space to determine the color of the pixel corresponding to each ray. For learning and inference, the technology disclosed in Non-Patent Document 2, for example, can be used. During learning, the learning weights are converged using backpropagation, with the L2 loss between the color calculated by volume rendering and the color of the captured image used as the loss function. During inference, a free-viewpoint image is rendered by volume rendering each ray at the free camera position and orientation.

ＦΘ：（ｘ，ｄ，ｚｔ）→（ｃ，σ）・・・（２）
式（２）において、ＦΘはマルチ・レイヤー・パーセプトロンからなるニューラルネットワークであり、サンプリングされる光線上の点の三次元座標ｘ、光線方向ベクトルｄに加え、時刻ｔのフレームにおける潜在コードｚｔを入力とする。ＦΘは各サンプリング点に対して、ＲＧＢの色ｃと密度σを出力する。 FΘ: (x, d, zt) → (c, σ) ... (2)
In equation (2), FΘ is a neural network consisting of a multi-layer perceptron, and inputs the three-dimensional coordinate x of the point on the sampled ray, the ray direction vector d, and the latent code zt in the frame at time t. FΘ outputs the RGB color c and density σ for each sampling point.

第１の実施形態のニューラルネットワーク部１０７は、単一の３Ｄモデルのみを表現する。一方、本実施形態のニューラルネットワーク部１４０１は、フレームごとに潜在コードｚｔを変化させることで、フレームごとに３Ｄ形状がダイナミックに変化するシーンを表現することが可能である。 The neural network unit 107 of the first embodiment represents only a single 3D model. On the other hand, the neural network unit 1401 of this embodiment can represent scenes in which the 3D shape changes dynamically from frame to frame by changing the latent code zt for each frame.

次に、図１５を参照して、制御部１０１によるダイナミック３Ｄモデル学習について説明する。図１５は、ダイナミック３Ｄモデル学習のフローチャートである。まずステップＳ２０１において、撮影カメラ位置姿勢推定部１０５は、学習に使用する各画像に対応するカメラ位置姿勢を推定する。撮影カメラの位置は固定されているため、最初のフレームで取得された画像を用いてカメラ位置姿勢を推定する。 Next, dynamic 3D model learning by the control unit 101 will be described with reference to Figure 15. Figure 15 is a flowchart of dynamic 3D model learning. First, in step S201, the shooting camera position and orientation estimation unit 105 estimates the camera position and orientation corresponding to each image used for learning. Because the position of the shooting camera is fixed, the camera position and orientation are estimated using the image acquired in the first frame.

続いてステップＳ１５０１において、制御部１０１は全フレームに対する処理が完了したか否かを判定する。全フレームに対する処理が完了した場合、本フローを終了する。一方、全フレームに対する処理が完了していない場合、ステップＳ１５０２に進む。ステップＳ１５０２において、制御部１０１は、各フレームに対応する潜在コードを生成する。 Next, in step S1501, the control unit 101 determines whether processing for all frames has been completed. If processing for all frames has been completed, this flow ends. On the other hand, if processing for all frames has not been completed, the process proceeds to step S1502. In step S1502, the control unit 101 generates a latent code corresponding to each frame.

続いてステップＳ１５０３において、制御部１０１は、イタレーション処理が完了したか否かを判定する。各フレームの学習は、イタレーション処理により学習重みを更新させることで、目標の学習重みに収束させる。イタレーション処理が完了した場合、ステップＳ１５０１へ戻る。一方、イタレーション処理が完了していない場合、ステップＳ１５０４に進む。 Next, in step S1503, the control unit 101 determines whether the iteration process is complete. Learning for each frame is performed by updating the learning weights through the iteration process, causing the learning weights to converge to the target learning weights. If the iteration process is complete, the process returns to step S1501. On the other hand, if the iteration process is not complete, the process proceeds to step S1504.

ステップＳ１５０４において、制御部１０１は、各イタレーションにおいて、まずバッチサイズ分の光線をランダムに選択する。非特許文献２には、入力ビデオの時間変動に基づいて学習のための次の光線を選択するレイ・インポータンス・サンプリングを行うことで、効率的に演算コストを低減させる技術が開示されている。本実施形態において、レイ・インポータンス・サンプリングを入力ビデオの被写界深度も考慮して行うことで、更なる演算効率化を図る。以下、図１６乃至図１９を参照して、その動作について説明する。 In step S1504, the control unit 101 first randomly selects rays equal to the batch size in each iteration. Non-Patent Document 2 discloses a technique for efficiently reducing computational costs by performing ray importance sampling, which selects the next ray for learning based on temporal fluctuations in the input video. In this embodiment, ray importance sampling is performed while also taking into account the depth of field of the input video, thereby achieving further computational efficiency. The operation will be described below with reference to Figures 16 to 19.

図１６は、教師画像の撮影の説明図であり、入力ビデオの冒頭フレームにおける、被写体、撮影空間および撮影カメラの関係を説明する鳥観図を示す。１６０１は撮影カメラ、３０２は３Ｄモデリング対象の撮影空間の範囲、３０３は主被写体、３０４は背景被写体である。撮影カメラ１６０１のピントは主被写体３０３に合焦されており、ハッチング部１６０５で示される広角な画角及びパンフォーカスな被写界深度内で撮影空間範囲３０２が全体的に合焦されている。撮影カメラ１６０１は、撮影空間範囲３０２内の被写体を様々な方向から撮影するために複数配置されるが、図１６では簡単のために撮影カメラ１６０１のみを示している。 Figure 16 is an explanatory diagram of the capture of a teacher image, showing a bird's-eye view illustrating the relationship between the subject, capture space, and capture camera in the first frame of the input video. 1601 is the capture camera, 302 is the range of the capture space to be 3D modeled, 303 is the main subject, and 304 is the background subject. The focus of the capture camera 1601 is on the main subject 303, and the entire capture space range 302 is in focus within the wide-angle angle of view and pan-focus depth of field indicated by the hatched area 1605. Multiple capture cameras 1601 are positioned to capture the subject within the capture space range 302 from various directions, but for simplicity, only the capture camera 1601 is shown in Figure 16.

撮影カメラ１６０１は、この後のフレームにおいて、焦点距離を標準画角に変化させ、絞りを開放寄りの明るい絞り値（Ｆ値）に変化させることで、主被写体３０３をより高精細に撮影できるようにカメラパラメータを変化させる。図１７は、教師画像の撮影の説明図であり、カメラパラメータ変化後の、被写体、撮影空間および撮影カメラの関係を説明する鳥観図を示す。撮影カメラ１６０１のピントは主被写体３０３に合焦されており、ハッチング部１７０５で示される標準画角および浅い被写界深度内で撮影空間範囲３０２が部分的に合焦されている。 In subsequent frames, the shooting camera 1601 changes the focal length to a standard angle of view and the aperture to a brighter, wider aperture (F-number), thereby changing the camera parameters so that the main subject 303 can be captured with higher resolution. Figure 17 is an explanatory diagram of capturing a teacher image, showing a bird's-eye view illustrating the relationship between the subject, shooting space, and shooting camera after the camera parameters have been changed. The shooting camera 1601 is focused on the main subject 303, and the shooting space range 302 is partially in focus within the standard angle of view and shallow depth of field indicated by the hatched area 1705.

図１８は、教師画像とピントマップの説明図であり、入力ビデオの冒頭フレームにおける撮影カメラ１６０１で取得された画像１８０１および付帯するメタデータであるピントマップ１８０２を説明する図を示す。図１８では、撮像面の合焦度合いをグレースケールマップの形式で示しており、手前が白、奥が黒、５０％グレーが合焦を示す。図１９は、教師画像とピントマップの説明図であり、カメラパラメータ変化後の撮影カメラ１６０１で取得された画像１９０１および付帯するメタデータであるピントマップ１９０２を説明する図を示す。 Figure 18 is an explanatory diagram of the teacher image and focus map, showing an image 1801 acquired by the shooting camera 1601 in the first frame of the input video and a focus map 1802, which is the associated metadata. In Figure 18, the degree of focus on the imaging surface is shown in the form of a grayscale map, with white in the foreground, black in the background, and 50% gray indicating in-focus. Figure 19 is an explanatory diagram of the teacher image and focus map, showing an image 1901 acquired by the shooting camera 1601 after a change in camera parameters, and a focus map 1902, which is the associated metadata.

入力ビデオの冒頭フレームにおける、図１６の光線１６０６および光線１６０７はボリュームレンダリン対象の光線で、かつ対応する三次元点が被写界深度内にあるため、２光線とも選択される。一方、カメラパラメータ変化後の、図１７の光線１７０６および光線１７０７もボリュームレンダリン対象の光線であるが、光線１７０６上には被写界深度範囲にある被写体が存在しないため、除外する。不要な光線であるか否かの判定は、図１９のピントマップ１９０２により判定することが可能である。光線１７０７に対応する画素１９０４が５０％グレーの合焦画素であるのに対して、光線１７０６に対応する画素１９０３は被写界深度外の濃いグレーであることから、除外対象の光線であると判定する。 In the first frame of the input video, rays 1606 and 1607 in Figure 16 are rays to be rendered for volume rendering, and the corresponding 3D points are within the depth of field, so both rays are selected. On the other hand, rays 1706 and 1707 in Figure 17 after the camera parameters have been changed are also rays to be rendered for volume rendering, but since there is no subject within the depth of field on ray 1706, it is excluded. Whether or not a ray is unnecessary can be determined using focus map 1902 in Figure 19. Pixel 1904 corresponding to ray 1707 is an in-focus pixel that is 50% gray, while pixel 1903 corresponding to ray 1706 is a dark gray outside the depth of field, so it is determined to be a ray to be excluded.

図１５のステップＳ１５０４にてバッチサイズ分の光線を選択した後、ステップＳ１５０５において、光線算出部１０６は光線を算出する。ボリュームレンダリングする光線は、前述の通り、ｒ（ｔ）＝ｏ＋ｔｄのように定義される。ここで、ｏは世界座標系におけるカメラの主点、ｄは世界座標系で表現される光線の方向ベクトル、ｔはカメラ主点から光線上のサンプリング点までの距離である。ボリュームレンダリングのためにサンプリングされる距離ｔの範囲は、図１６のハッチング部１６０５、および図１７のハッチング部１７０５で示される被写界深度内に制限される。このように、レイ・インポータンス・サンプリングを、被写界深度を考慮して行うことで、ボリュームレンダリング処理を高速化することができる。 After selecting rays for the batch size in step S1504 of Figure 15, the ray calculation unit 106 calculates rays in step S1505. As described above, the ray for volume rendering is defined as r(t) = o + td. Here, o is the principal point of the camera in the world coordinate system, d is the direction vector of the ray expressed in the world coordinate system, and t is the distance from the camera principal point to the sampling point on the ray. The range of distance t sampled for volume rendering is limited within the depth of field indicated by the hatched area 1605 in Figure 16 and the hatched area 1705 in Figure 17. In this way, by performing ray importance sampling while taking the depth of field into consideration, it is possible to speed up the volume rendering process.

また本実施形態において、被写界深度が深い図１６の場合でも被写界深度に応じたサンプリング範囲ｔを制限しているが、これに限定されるものではない。このような場合には、サンプリング範囲を制限する効果も低いことから、サンプリング範囲を固定化するために十分広い固定範囲を設定するようにしてもよい。 In this embodiment, even in the case of Figure 16 where the depth of field is deep, the sampling range t is limited according to the depth of field, but this is not limited. In such cases, the effect of limiting the sampling range is low, so a sufficiently wide fixed range may be set to fix the sampling range.

図１５のステップＳ１５０５にて光線が算出した後、ステップＳ１５０６において、ニューラルネットワーク部１４０１は、学習重みを更新する。ステップＳ１５０３～Ｓ１５０６の動作を学習重みが収束するまで繰り返すことで、学習重みを決定する。入力ビデオの全フレームの処理が完了した場合、本フローを終了する（ステップＳ１５０１）。 After calculating the rays in step S1505 of Figure 15, the neural network unit 1401 updates the learning weights in step S1506. The learning weights are determined by repeating the operations of steps S1503 to S1506 until the learning weights converge. When processing of all frames of the input video is complete, this flow ends (step S1501).

本実施形態において、被写界深度に基づいてレイ・インポータンス・サンプリングを行うが、これに限定されるものではなく、入力ビデオの時間変動と被写界深度の両方に基づいてレイ・インポータンス・サンプリングを行うようにしてもよい。多くの場合、入力ビデオのほうがピントマップより解像度が高い。このため、入力ビデオの時間変動を観測することで、より解像度高く不要な光線を除外したうえで、サンプリング範囲をピントマップに基づいて制限することができるため、更なる演算効率化が期待できる。 In this embodiment, ray importance sampling is performed based on the depth of field, but this is not limited to this. Ray importance sampling may also be performed based on both the depth of field and the temporal variation of the input video. In many cases, the input video has a higher resolution than the focus map. Therefore, by observing the temporal variation of the input video, it is possible to remove unnecessary light rays with higher resolution and then limit the sampling range based on the focus map, which is expected to further improve computational efficiency.

次に、図２０を参照して、制御部１０１による自由視点画像レンダリングについて説明する。図２０は、本実施形態における自由視点画像レンダリングのフローチャートである。まずステップＳ５０１において、自由カメラ位置姿勢取得部１０８は、レンダリングする自由視点のカメラ位置姿勢を取得する。 Next, free viewpoint image rendering by the control unit 101 will be described with reference to FIG. 20. FIG. 20 is a flowchart of free viewpoint image rendering in this embodiment. First, in step S501, the free camera position and orientation acquisition unit 108 acquires the camera position and orientation of the free viewpoint to be rendered.

続いてステップＳ２００１において、制御部１０１は全フレームに対する処理が完了したか否かを判定する。全フレームに対する処理が完了した場合、本フローを終了する。一方、全フレームに対する処理が完了していない場合、ステップＳ２００２に進み、各フレームの潜在コードＺｔを更新しつつ、レンダリング画像の全画素に亘って、ボリュームレンダリングにより画素値（ＲＧＢ値）を算出する。すなわちステップＳ２００２において、制御部１０１は、各フレームに対応する潜在コードを生成する。続いてステップＳ２００３において、制御部１０１は、ボリュームレンダリングによる画素値（ＲＧＢ値）の算出がレンダリング画像の全画素に関して完了したか否かを判定する。全画素に対して画素値の算出が完了した場合、ステップＳ２００１に戻る。一方、全画素に対して画素値の算出が完了していない場合、ステップＳ２００４に進む。 Next, in step S2001, the control unit 101 determines whether processing for all frames has been completed. If processing for all frames has been completed, this flow ends. On the other hand, if processing for all frames has not been completed, the process proceeds to step S2002, where pixel values (RGB values) are calculated by volume rendering for all pixels of the rendered image while updating the latent code Zt for each frame. That is, in step S2002, the control unit 101 generates a latent code corresponding to each frame. Next, in step S2003, the control unit 101 determines whether calculation of pixel values (RGB values) by volume rendering has been completed for all pixels of the rendered image. If calculation of pixel values for all pixels has been completed, the process returns to step S2001. On the other hand, if calculation of pixel values for all pixels has not been completed, the process proceeds to step S2004.

ステップＳ２００４において、光線算出部１０６は、各画素に対応する光線を算出する（Ｓ２００４）。ボリュームレンダリングする光線は、前述の通り、ｒ（ｔ）＝ｏ＋ｔｄのように定義される。ここで、ｏは世界座標系におけるカメラの主点、ｄは世界座標系で表現される光線の方向ベクトル、ｔはカメラ主点から光線上のサンプリング点までの距離である。ステップＳ２００４にて光線が算出した後、ステップＳ２００５において、ニューラルネットワーク部１０７は、ボリュームレンダリング処理を実行し、対応する画素値（ＲＧＢ値）を算出する。 In step S2004, the ray calculation unit 106 calculates the ray corresponding to each pixel (S2004). As described above, the ray to be volume rendered is defined as r(t) = o + td. Here, o is the principal point of the camera in the world coordinate system, d is the direction vector of the ray expressed in the world coordinate system, and t is the distance from the camera principal point to the sampling point on the ray. After the ray is calculated in step S2004, in step S2005, the neural network unit 107 executes volume rendering processing and calculates the corresponding pixel value (RGB value).

本実施形態において、入力ビデオの冒頭とそれ以降で焦点距離および被写界深度を変更するが、これに限定されるものではなく、例えば標準画角かつ浅い被写界深度で撮影している途中で複数回広角かつパンフォーカスな撮影を挟むようにしてもよい。これにより、入力ビデオの尺が長い場合に、途中で日差しなどの周辺環境が徐々に変化しても対応することができる。 In this embodiment, the focal length and depth of field are changed between the beginning and subsequent portions of the input video, but this is not limited to this. For example, while shooting with a standard angle of view and shallow depth of field, multiple wide-angle, pan-focus shots may be inserted. This makes it possible to accommodate gradual changes in the surrounding environment, such as sunlight, during the course of a long input video.

本実施形態において、入力ビデオの冒頭とそれ以降で焦点距離および被写界深度を変更するが、これに限定されるものではなく、絞りだけ制御して被写界深度のみを変更するようにしてもよい。これにより、ズーム機構が無い単焦点レンズでも撮影が可能になる。もしくは、広角な単焦点カメラと標準画角の単焦点カメラを複数配置するようにしてもよい。また、ズームレンズであれば、絞り開放のまま標準画角から広角に焦点距離を変化させても被写界深度は深くなるため、それで十分な場合は焦点距離だけ制御するようにしてもよい。 In this embodiment, the focal length and depth of field are changed between the beginning and subsequent portions of the input video, but this is not limited to this; it is also possible to control only the aperture and change only the depth of field. This makes it possible to shoot with a prime lens that does not have a zoom mechanism. Alternatively, multiple wide-angle prime cameras and multiple standard-angle prime cameras may be arranged. Furthermore, with a zoom lens, the depth of field will deepen even if the focal length is changed from a standard angle of view to a wide angle while keeping the aperture wide open, so if this is sufficient, it is also possible to control only the focal length.

本実施形態において、入力ビデオの冒頭とそれ以降で焦点距離および被写界深度を変更するが、これに限定されるものではなく、フォーカスを徐々に背景から主被写体に変更するようにしてもよい。これにより、主被写体だけではなくそれ以外の被写体も高精細に撮影することができる。フォーカスを変化させると像倍率が変化するので、さらに像倍率変化を補正することでより高精度に３Ｄモデルを学習させることができる。 In this embodiment, the focal length and depth of field are changed between the beginning and subsequent parts of the input video, but this is not limited to this; the focus may be gradually changed from the background to the main subject. This allows not only the main subject but also other subjects to be captured in high definition. Changing the focus also changes the image magnification, so by further correcting for this change in image magnification, the 3D model can be trained with higher accuracy.

このように本実施形態において、教師画像は、所定のフレームレートで取得された動画である。学習パラメータ算出手段は、動画のフレームごとに該フレームの特徴を示すコード特徴量（潜在コード）を算出し、動画のフレームごとにコード特徴量と教師画像とを用いることで機械学習を行い、学習パラメータを算出する。本実施形態において、取得手段は、教師画像において時間的に変化する領域（時間的な外観変化が顕著な領域）を取得し、光線算出手段は、取得手段により取得された領域の光線を重点的に選択して、対応する光線を出力してもよい。また本実施形態において、教師画像は、フォーカス位置、焦点距離、または絞り値のうち少なくとも一つが異なるフレームを含むように取得された動画である。 In this embodiment, the teacher image is a video captured at a predetermined frame rate. The learning parameter calculation means calculates, for each frame of the video, a code feature (latent code) that indicates the characteristics of that frame, and performs machine learning using the code feature and the teacher image for each frame of the video to calculate learning parameters. In this embodiment, the acquisition means acquires regions in the teacher image that change over time (regions where there is a significant change in appearance over time), and the ray calculation means may focus on selecting ray beams in the regions acquired by the acquisition means and output corresponding ray beams. Also in this embodiment, the teacher image is a video captured so as to include frames that differ in at least one of focus position, focal length, or aperture value.

本実施形態によれば、被写体に動きがあるダイナミックシーンにおいても効率的にボリュームレンダリングを行うことで、高速に３Ｄモデルを生成することができる。 According to this embodiment, volume rendering can be performed efficiently even in dynamic scenes where the subject is moving, allowing 3D models to be generated quickly.

（その他の実施形態）
本発明は、上述の実施形態の１以上の機能を実現するプログラムを、ネットワーク又は記憶媒体を介してシステム又は装置に供給し、そのシステム又は装置のコンピュータにおける１つ以上のプロセッサがプログラムを読出し実行する処理でも実現可能である。また、１以上の機能を実現する回路（例えば、ＡＳＩＣ）によっても実現可能である。 (Other embodiments)
The present invention can also be realized by supplying a program that realizes one or more of the functions of the above-described embodiments to a system or device via a network or a storage medium, and having one or more processors in the computer of the system or device read and execute the program.The present invention can also be realized by a circuit (e.g., an ASIC) that realizes one or more of the functions.

各実施形態によれば、ボリュームレンダリングする光線とサンプリング範囲を適切に制限することで、高速に物体の緻密な形状を計測し、実写のような画像を再構成することができる。このため各実施形態によれば、高速にボリュームレンダリングを行うことが可能な画像処理装置、画像処理方法、およびプログラムを提供することができる。 According to each embodiment, by appropriately limiting the rays and sampling range used for volume rendering, it is possible to measure the detailed shape of an object at high speed and reconstruct a realistic image. Therefore, each embodiment can provide an image processing device, image processing method, and program capable of performing volume rendering at high speed.

以上、本発明の好ましい実施形態について説明したが、本発明はこれらの実施形態に限定されず、その要旨の範囲内で種々の変形及び変更が可能である。 The above describes preferred embodiments of the present invention, but the present invention is not limited to these embodiments, and various modifications and variations are possible within the scope of the invention.

１００パーソナルコンピュータ（画像処理装置）
１０５撮影カメラ位置姿勢推定部（取得手段）
１０６光線算出部（光線算出手段）
１０７ニューラルネットワーク部（学習パラメータ算出手段、レンダリング手段）
１０８自由カメラ位置姿勢取得部（取得手段） 100 Personal computer (image processing device)
105 Shooting camera position and orientation estimation unit (acquisition means)
106 Light ray calculation unit (light ray calculation means)
107 Neural network unit (learning parameter calculation means, rendering means)
108 Free camera position and orientation acquisition unit (acquisition means)

Claims

an acquisition means for acquiring a teacher image and a camera position corresponding to the teacher image;
a ray calculation means for calculating a ray corresponding to each pixel of the teacher image using the position of the camera;
a learning parameter calculation means for performing machine learning by sampling points on the light ray and using the teacher image, and calculating learning parameters;
An image processing device, characterized in that the sampling density of the light rays within the range of the depth of field of the teacher image is higher than the sampling density outside the range of the depth of field.

An acquisition means for acquiring the position of the camera;
a ray calculation means for calculating a ray corresponding to each pixel of the camera using the position of the camera;
a rendering means for rendering an image of the camera by machine learning using learning parameters that have been learned in advance by sampling points on the light ray by a learning parameter calculation means;
An image processing device characterized in that the sampling density of the light rays within the range of the depth of field of the teacher image used in the pre-learning is higher than the sampling density outside the range of the depth of field.

the acquiring means further acquires distance distribution information corresponding to the teacher image,
3. The image processing apparatus according to claim 1, wherein the ray calculation means determines whether the ray is within the depth of field of the teacher image based on the distance distribution information.

It is determined whether the object surface is closer to or further from the object focus distance based on the distance distribution information,
4. The image processing device according to claim 3, wherein the sampling density is changed based on a determination result as to whether the object surface is closer to or further from the object focus distance.

a distance to a surface of the object is determined based on the distance distribution information;
5. The image processing apparatus according to claim 3, wherein the sampling density is determined based on the distance to the surface of the subject.

An image processing device according to any one of claims 3 to 5, characterized in that the distance distribution information has a lower resolution than the training image.

7. The image processing device according to claim 3, wherein the ray calculation means identifies a subject contour area based on the distance distribution information, makes it easier to select rays for the subject contour area, and sets a wide sampling range.

An image processing device according to any one of claims 3 to 7, characterized in that the ray calculation means selects all rays for areas determined to have low reliability based on the distance distribution information and sets a wide sampling range.

An image processing device according to any one of claims 3 to 8, characterized in that the acquisition means processes the teacher image and the distance distribution information based on distortion components of the camera's optical system.

An image processing device according to any one of claims 3 to 9, characterized in that the distance distribution information includes at least one of a map based on a shift amount representing parallax, a map based on a defocus amount, or a map based on distance.

The acquisition means further acquires a peripheral teacher image of the peripheral space,
The image processing device according to claim 1 , wherein the learning parameter calculation means performs the machine learning by using the teacher image and the peripheral teacher images, and calculates the learning parameters.

The image processing device described in claim 11, characterized in that the peripheral teacher image is an image that differs from the teacher image in at least one of focus position, focal length, and aperture value.

The image processing device described in claim 11 or 12, characterized in that the learning parameter calculation means learns different machine learning models for the teacher image and the surrounding teacher images.

the teacher image is a video captured at a predetermined frame rate,
The learning parameter calculation means
calculating a code feature that indicates a feature of each frame of the video;
The image processing device according to any one of claims 1 to 13, characterized in that machine learning is performed by using the chord features and the teacher image for each frame of the video, and the learning parameters are calculated.

the acquiring means acquires a region that changes over time in the teacher image,
15. The image processing apparatus according to claim 14, wherein the light ray calculation means selects light rays intensively from the area acquired by the acquisition means, and outputs corresponding light rays.

An image processing device as described in claim 14 or 15, characterized in that the teacher image is a video captured to include frames that differ in at least one of focus position, focal length, or aperture value.

An image processing device according to any one of claims 1 to 16, characterized in that the allowable circle of confusion diameter for determining the depth of field is determined based on the resolution during rendering using the learning parameters.

acquiring a teacher image and a camera position corresponding to the teacher image;
calculating a ray corresponding to each pixel of the teacher image using the position of the camera;
performing machine learning by sampling points on the ray and using the training image to calculate learning parameters;
An image processing method, characterized in that the sampling density of the light rays within the depth of field range of the teacher image is higher than the sampling density outside the depth of field range.

obtaining a camera position;
calculating a ray corresponding to each pixel of the camera using the position of the camera;
and rendering the image of the camera by machine learning using learning parameters pre-trained by sampling points on the light ray by a learning parameter calculation means;
An image processing method characterized in that the sampling density of the light rays within the depth of field range of the teacher image used in the pre-learning is higher than the sampling density outside the depth of field range.

A program causing a computer to execute the image processing method described in claim 18 or 19.