WO2025120714A1

WO2025120714A1 - Content server, client terminal, image display system, display data transmission method, and display image generation method

Info

Publication number: WO2025120714A1
Application number: PCT/JP2023/043347
Authority: WO
Inventors: 新宇張; 徳秀金子; 和之有松
Original assignee: Sony Interactive Entertainment Inc
Current assignee: Sony Interactive Entertainment Inc
Priority date: 2023-12-04
Filing date: 2023-12-04
Publication date: 2025-06-12
Anticipated expiration: 2026-06-04

Abstract

This content server 20 acquires neural networks 230a, 23b,... representing information about a plurality of types of 3D scenes by generating a learning image representing each scene in a display world and performing machine learning. The content server 20 divides the neural networks 230a, 23b,... to generate a plurality of neural networks 232a, 232b, etc., randomly switches the order of packets, and transmits the packets to a client terminal 10. The client terminal 10 draws a display image 238 corresponding to the most recent display viewpoint by, for example, returning the divided neural networks 232a, 232b, etc. to the original neural network 230a.

Description

Content server, client terminal, image display system, display data transmission method, and display image generation method

　この発明は、３次元の表示世界の画像を表示させるコンテンツサーバ、クライアント端末、画像表示システム、表示用データ送信方法、および表示画像生成方法に関する。 This invention relates to a content server, a client terminal, an image display system, a display data transmission method, and a display image generation method that display images of a three-dimensional display world.

　近年の通信網の拡充や画像処理技術の発展により、多様な電子コンテンツを視聴環境によらず楽しむことができるようになってきた。例えば電子ゲームの分野では、個々のクライアント端末に入力された操作情報をサーバが収集し、それらを随時反映させたゲーム画像を配信することで、複数のプレイヤが場所を問わず同一のゲームに参加できるシステムが普及している。　With the recent expansion of communication networks and developments in image processing technology, it has become possible to enjoy a wide variety of electronic content regardless of the viewing environment. For example, in the field of electronic games, a system has become widespread in which a server collects operation information entered into each client terminal and distributes game images that reflect this information as needed, allowing multiple players to participate in the same game regardless of location.

　電子ゲームに限らず、ユーザ操作に応じてリアルタイムで生成した動画像をサーバから配信する形式の電子コンテンツでは、サーバの潤沢な処理環境を利用できるため、クライアント端末の処理性能の影響を最小限に高品質な画像を表示しやすくなる。一方、クライアント端末からの操作情報の送信や、それを受けたサーバからのデータ伝送の処理が常に介在することにより、視点操作に対し表示が追随できなかったり、パケットロスにより画像が欠損したりして、ユーザ体験の質が低下する可能性がある。 Not limited to electronic games, electronic content in which video images generated in real time in response to user operations are distributed from a server can utilize the server's abundant processing environment, making it easier to display high-quality images while minimizing the impact of the client terminal's processing performance. On the other hand, the constant transmission of operation information from the client terminal and the data transmission processing from the server that receives it can cause the display to be unable to keep up with viewpoint operations or images to be lost due to packet loss, resulting in a deterioration in the quality of the user experience.

　本発明はこうした課題に鑑みてなされたものであり、その目的は、サーバからの配信を伴うコンテンツの画像処理において、配信によるユーザ体験の質への影響を軽減する技術を提供することにある。 The present invention was made in consideration of these problems, and its purpose is to provide a technology that reduces the impact of distribution on the quality of the user experience when processing images of content that is distributed from a server.

　上記課題を解決するために、本発明のある態様はコンテンツサーバに関する。このコンテンツサーバは、ユーザ操作に応じて状況が変化する３次元の表示世界の各シーンを、複数の視点から見た様子を表す画像を、学習用画像として生成する学習用画像生成部と、学習用画像を教師データとする機械学習により、各シーンの３次元情報を表す複数種類の３Ｄシーン情報を取得する種類別３Ｄシーン情報取得部と、３Ｄシーン情報を用いて表示画像を描画するクライアント端末に、複数種類の３Ｄシーン情報のデータを、シーンごとの機械学習が完了した順に送信する３Ｄシーン情報送信部と、を備えたことを特徴とする。 In order to solve the above problem, one aspect of the present invention relates to a content server. This content server is characterized by having a learning image generation unit that generates learning images that represent the state of each scene in a three-dimensional displayed world, where the situation changes in response to user operations, as viewed from multiple viewpoints; a type-specific 3D scene information acquisition unit that acquires multiple types of 3D scene information that represent the three-dimensional information of each scene through machine learning using the learning images as teacher data; and a 3D scene information transmission unit that transmits data on the multiple types of 3D scene information to a client terminal that uses the 3D scene information to draw a display image, in the order in which machine learning for each scene is completed.

　本発明の別の態様はクライアント端末に関する。このクライアント端末は、ユーザ操作の情報と、３次元の表示世界に対する表示用視点の情報とを取得する入力情報取得部と、ユーザ操作に応じて状況が変化する表示世界の各シーンに対し、機械学習により取得された、３次元情報を表す複数種類の３Ｄシーン情報のデータを、サーバから取得する３Ｄシーン情報取得部と、最新の表示用視点に基づき、直近に取得された３Ｄシーン情報を用いて、表示画像のフレームの少なくとも一部を描画する画像生成部と、を備えたことを特徴とする。 Another aspect of the present invention relates to a client terminal. This client terminal is characterized by having an input information acquisition unit that acquires information on user operations and information on a display viewpoint for a three-dimensional display world, a 3D scene information acquisition unit that acquires from a server data on multiple types of 3D scene information that represent three-dimensional information acquired by machine learning for each scene in the display world whose situation changes according to user operations, and an image generation unit that draws at least a portion of a frame of a display image using the most recently acquired 3D scene information based on the latest display viewpoint.

　本発明のさらに別の態様は画像表示システムに関する。この画像表示システムは、ユーザ操作に応じて状況が変化する３次元の表示世界の画像を表示させるクライアント端末と、表示画像の生成に用いるデータを送信するコンテンツサーバと、を含み、コンテンツサーバは、表示世界の各シーンを、複数の視点から見た様子を表す画像を、学習用画像として生成する学習用画像生成部と、学習用画像を教師データとする機械学習により、各シーンの３次元情報を表す複数種類の３Ｄシーン情報を取得する種類別３Ｄシーン情報取得部と、クライアント端末に、複数種類の３Ｄシーン情報のデータを、シーンごとの機械学習が完了した順に送信する３Ｄシーン情報送信部と、を備え、クライアント端末は、ユーザ操作の情報と、表示世界に対する表示用視点の情報とを取得する入力情報取得部と、複数種類の３Ｄシーン情報のデータを、コンテンツサーバから取得する３Ｄシーン情報取得部と、最新の表示用視点に基づき、直近に取得された３Ｄシーン情報を用いて、表示画像のフレームの少なくとも一部を描画する画像生成部と、を備えたことを特徴とする。 Another aspect of the present invention relates to an image display system. This image display system includes a client terminal that displays an image of a three-dimensional display world in which the situation changes according to user operations, and a content server that transmits data used to generate the display image. The content server includes a learning image generation unit that generates images showing how each scene in the display world looks from multiple viewpoints as learning images, a type-specific 3D scene information acquisition unit that acquires multiple types of 3D scene information that represent the three-dimensional information of each scene by machine learning using the learning images as teacher data, and a 3D scene information transmission unit that transmits the multiple types of 3D scene information data to the client terminal in the order in which machine learning for each scene is completed. The client terminal includes an input information acquisition unit that acquires information on user operations and information on a display viewpoint for the display world, a 3D scene information acquisition unit that acquires data on the multiple types of 3D scene information from the content server, and an image generation unit that draws at least a part of the frame of the display image using the most recently acquired 3D scene information based on the latest display viewpoint.

　本発明のさらに別の態様は表示用データ送信方法に関する。この表示用データ送信方法は、ユーザ操作に応じて状況が変化する３次元の表示世界の各シーンを、複数の視点から見た様子を表す画像を、学習用画像として生成するステップと、学習用画像を教師データとする機械学習により、各シーンの３次元情報を表す複数種類の３Ｄシーン情報を取得するステップと、３Ｄシーン情報を用いて表示画像を描画するクライアント端末に、複数種類の３Ｄシーン情報のデータを、シーンごとの機械学習が完了した順に送信するステップと、を含むことを特徴とする。 Another aspect of the present invention relates to a display data transmission method. This display data transmission method is characterized by including the steps of: generating, as learning images, images that represent how each scene in a three-dimensional display world, in which the situation changes in response to user operations, is viewed from multiple viewpoints; acquiring multiple types of 3D scene information that represent the three-dimensional information of each scene through machine learning using the learning images as teacher data; and transmitting data on the multiple types of 3D scene information to a client terminal that uses the 3D scene information to draw a display image, in the order in which machine learning for each scene is completed.

　本発明のさらに別の態様は表示画像生成方法に関する。この表示画像生成方法は、ユーザ操作の情報と、３次元の表示世界に対する表示用視点の情報とを取得するステップと、ユーザ操作に応じて状況が変化する表示世界の各シーンに対し、機械学習により取得された、３次元情報を表す複数種類の３Ｄシーン情報のデータを、サーバから取得するステップと、最新の表示用視点に基づき、直近に取得された３Ｄシーン情報を用いて、表示画像のフレームの少なくとも一部を描画するステップと、を含むことを特徴とする。 Another aspect of the present invention relates to a display image generating method, which is characterized by including the steps of: acquiring information on user operations and information on a display viewpoint for a three-dimensional display world; acquiring from a server data on multiple types of 3D scene information representing three-dimensional information acquired by machine learning for each scene in the display world whose situation changes according to the user operations; and drawing at least a part of a frame of a display image using the most recently acquired 3D scene information based on the latest display viewpoint.

　なお、以上の構成要素の任意の組合せ、本発明の表現を方法、装置、システム、コンピュータプログラム、データ構造、記録媒体などの間で変換したものもまた、本発明の態様として有効である。 In addition, any combination of the above components, and any conversion of the present invention into a method, device, system, computer program, data structure, recording medium, etc., are also valid aspects of the present invention.

　本発明によれば、サーバからの配信を伴うコンテンツの画像処理において、配信によるユーザ体験の質への影響を軽減できる。 According to the present invention, when processing images of content that is distributed from a server, the impact of distribution on the quality of the user experience can be reduced.

本実施の形態を適用できる画像表示システムの構成例を示す図である。1 is a diagram showing a configuration example of an image display system to which the present embodiment can be applied; 本実施の形態のクライアント端末の内部回路構成を示す図である。FIG. 2 is a diagram illustrating an internal circuit configuration of a client terminal according to the present embodiment. 本実施の形態の画像処理の基本的な流れを、従来技術と比較して示す図である。FIG. 1 is a diagram showing a basic flow of image processing according to the present embodiment in comparison with the prior art. 本実施の形態におけるクライアント端末およびコンテンツサーバの機能ブロックの構成を示す図である。2 is a diagram showing a configuration of functional blocks of a client terminal and a content server according to the present embodiment. FIG. 本実施の形態においてコンテンツサーバが３Ｄシーン情報を取得する手順を模式的に示す図である。10 is a diagram illustrating a procedure in which a content server acquires 3D scene information in the present embodiment. FIG. 本実施の形態において、異なる範囲の３Ｄシーン情報を取得する態様を説明するための図である。1A to 1C are diagrams for explaining how 3D scene information of different ranges is acquired in the present embodiment. 本実施の形態において、３Ｄシーン情報の送受信におけるデータの変遷を模式的に示す図である。10 is a diagram showing a schematic diagram of data transition during transmission and reception of 3D scene information in the present embodiment. FIG. 本実施の形態における、コンテンツサーバにおける機械学習と、クライアント端末における画像描画の時間的関係を説明するための図である。A diagram for explaining the temporal relationship between machine learning in a content server and image drawing in a client terminal in this embodiment. 本実施の形態における、クライアント端末の画像生成部による画像補正処理を説明するための図である。11A and 11B are diagrams for explaining an image correction process performed by an image generating unit of a client terminal in the present embodiment.

　図１は本実施の形態を適用できる画像表示システムの構成例を示す。画像表示システム１は、ユーザ操作に応じて画像を表示させるクライアント端末１０ａ、１０ｂ、１０ｃおよび、表示に用いるデータを提供するコンテンツサーバ２０を含む。クライアント端末１０ａ、１０ｂ、１０ｃにはそれぞれ、ユーザ操作のための入力装置１４ａ、１４ｂ、１４ｃと、画像を表示する表示装置１６ａ、１６ｂ、１６ｃが接続される。クライアント端末１０ａ、１０ｂ、１０ｃとコンテンツサーバ２０は、ＷＡＮ（World Area Network）やＬＡＮ（Local Area Network）などのネットワーク８を介して通信を確立できる。 FIG. 1 shows an example of the configuration of an image display system to which this embodiment can be applied. The image display system 1 includes client terminals 10a, 10b, 10c that display images in response to user operations, and a content server 20 that provides data used for display. Input devices 14a, 14b, 14c for user operations and display devices 16a, 16b, 16c that display images are connected to the client terminals 10a, 10b, 10c, respectively. The client terminals 10a, 10b, 10c and the content server 20 can establish communication via a network 8 such as a WAN (World Area Network) or a LAN (Local Area Network).

　クライアント端末１０ａ、１０ｂ、１０ｃと、表示装置１６ａ、１６ｂ、１６ｃおよび入力装置１４ａ、１４ｂ、１４ｃはそれぞれ、有線または無線のどちらで接続されてもよい。あるいはそれらの装置の２つ以上が一体的に形成されていてもよい。例えば図においてクライアント端末１０ｂは、表示装置１６ｂであるヘッドマウントディスプレイに接続している。ヘッドマウントディスプレイは、それを頭部に装着したユーザの動きによって表示画像の視野を変更できるため、入力装置１４ｂとしても機能する。 The client terminals 10a, 10b, 10c may be connected to the display devices 16a, 16b, 16c and the input devices 14a, 14b, 14c either wired or wirelessly. Alternatively, two or more of these devices may be formed integrally. For example, in the figure, the client terminal 10b is connected to a head-mounted display, which is the display device 16b. The head-mounted display can change the field of view of the displayed image according to the movement of the user wearing it on the head, so it also functions as the input device 14b.

　またクライアント端末１０ｃは携帯端末であり、表示装置１６ｃと、その画面を覆うタッチパッドである入力装置１４ｃと一体的に構成されている。このように、図示する装置の外観形状や接続形態は限定されない。ネットワーク８に接続するクライアント端末１０ａ、１０ｂ、１０ｃやコンテンツサーバ２０の数も限定されない。以後、クライアント端末１０ａ、１０ｂ、１０ｃをクライアント端末１０、入力装置１４ａ、１４ｂ、１４ｃを入力装置１４、表示装置１６ａ、１６ｂ、１６ｃを表示装置１６と総称する。 The client terminal 10c is a portable terminal that is integrated with the display device 16c and the input device 14c, which is a touchpad that covers the screen of the display device 16c. In this way, there are no limitations on the external shape or connection form of the illustrated devices. There are also no limitations on the number of client terminals 10a, 10b, 10c and content servers 20 that are connected to the network 8. Hereinafter, the client terminals 10a, 10b, 10c will be collectively referred to as client terminals 10, the input devices 14a, 14b, 14c as input device 14, and the display devices 16a, 16b, 16c as display device 16.

　入力装置１４は、コントローラ、キーボード、マウス、タッチパッド、ジョイスティックなど一般的な入力装置や、ヘッドマウントディスプレイが備えるモーションセンサ、カメラなどの各種センサのいずれか、または組み合わせでよく、クライアント端末１０へユーザ操作の内容を供給する。表示装置１６は、液晶ディスプレイ、プラズマディスプレイ、有機ＥＬディスプレイ、ウェアラブルディスプレイ、プロジェクタなど一般的なディスプレイでよく、クライアント端末１０から出力される画像を表示する。 The input device 14 may be any one or a combination of general input devices such as a controller, keyboard, mouse, touchpad, joystick, or various sensors such as a motion sensor or camera equipped in a head mounted display, and supplies the contents of user operations to the client terminal 10. The display device 16 may be any general display such as a liquid crystal display, plasma display, organic EL display, wearable display, or projector, and displays images output from the client terminal 10.

　コンテンツサーバ２０は、画像表示を伴うコンテンツのデータをクライアント端末１０に提供する。当該コンテンツの種類は特に限定されず、電子ゲーム、観賞用画像、ウェブページ、アバターによるビデオチャットなどのいずれでもよい。本実施の形態においてコンテンツサーバ２０は、入力装置１４に対するユーザ操作の情報を逐次、クライアント端末１０から取得し、表示対象の世界に反映させたうえ、それを表す画像がクライアント端末１０側で表示されるように、必要なデータを送信する。 The content server 20 provides data of content accompanied by image display to the client terminal 10. The type of content is not particularly limited, and may be any of electronic games, decorative images, web pages, video chat using avatars, etc. In this embodiment, the content server 20 sequentially obtains information on user operations on the input device 14 from the client terminal 10, reflects this information in the world to be displayed, and transmits the necessary data so that an image representing this is displayed on the client terminal 10 side.

　図２はクライアント端末１０の内部回路構成を示している。クライアント端末１０は、ＣＰＵ（Central Processing Unit）１２２、ＧＰＵ（Graphics Processing Unit)１２４、メインメモリ１２６を含む。これらの各部は、バス１３０を介して相互に接続されている。バス１３０にはさらに入出力インターフェース１２８が接続されている。入出力インターフェース１２８には、ＵＳＢなどの周辺機器インターフェースや、有線又は無線ＬＡＮのネットワークインターフェースからなる通信部１３２、ハードディスクドライブや不揮発性メモリなどの記憶部１３４、表示装置１６へデータを出力する出力部１３６、入力装置１４からデータを入力する入力部１３８、磁気ディスク、光ディスクまたは半導体メモリなどのリムーバブル記録媒体を駆動する記録媒体駆動部１４０が接続される。 Figure 2 shows the internal circuit configuration of the client terminal 10. The client terminal 10 includes a CPU (Central Processing Unit) 122, a GPU (Graphics Processing Unit) 124, and a main memory 126. These components are interconnected via a bus 130. An input/output interface 128 is also connected to the bus 130. To the input/output interface 128, there are connected a communication unit 132 consisting of a peripheral device interface such as a USB or a network interface for a wired or wireless LAN, a storage unit 134 such as a hard disk drive or non-volatile memory, an output unit 136 that outputs data to the display device 16, an input unit 138 that inputs data from the input device 14, and a recording medium drive unit 140 that drives a removable recording medium such as a magnetic disk, optical disk, or semiconductor memory.

　ＣＰＵ１２２は、記憶部１３４に記憶されているオペレーティングシステムを実行することにより、クライアント端末１０の全体を制御する。ＣＰＵ１２２はまた、リムーバブル記録媒体から読み出されてメインメモリ１２６にロードされた、あるいは通信部１３２を介してダウンロードされた各種プログラムを実行する。ＧＰＵ１２４は、ＣＰＵ１２２からの描画命令に従って描画処理を行い、表示画像を図示しないフレームバッファに格納する。そしてフレームバッファに格納された表示画像をビデオ信号に変換して出力部１３６に出力する。メインメモリ１２６はＲＡＭ（Random Access Memory）により構成され、処理に必要なプログラムやデータを記憶する。コンテンツサーバ２０も同様の内部回路構成を有してよい。 The CPU 122 executes an operating system stored in the storage unit 134 to control the entire client terminal 10. The CPU 122 also executes various programs that have been read from a removable recording medium and loaded into the main memory 126, or downloaded via the communication unit 132. The GPU 124 performs drawing processing in accordance with drawing commands from the CPU 122, and stores the display image in a frame buffer (not shown). The GPU 124 then converts the display image stored in the frame buffer into a video signal and outputs it to the output unit 136. The main memory 126 is composed of RAM (Random Access Memory), and stores programs and data necessary for processing. The content server 20 may also have a similar internal circuit configuration.

　図３は、本実施の形態の画像処理の基本的な流れを、従来技術と比較して示している。本実施の形態では、様々なオブジェクトが存在する３次元空間の世界を主たる表示対象とする。当該世界の状況は、プログラム等の規定やユーザ操作に応じて変化する。（ａ）に示す一般的な処理の場合、コンテンツサーバは、ユーザ操作の内容や、表示世界に対する視点の位置、視線の方向の情報を随時取得する。以後、表示対象の３次元空間全体を「表示世界」、表示視野内またはその近傍の表示世界の状態を「シーン」と呼ぶ。また、シーンに対する視点の位置および視線の方向を、単に「視点」と総称する場合がある。視点はユーザが、入力装置１４を介して手動で操作してもよいし、ヘッドマウントディスプレイが備えるモーションセンサなどによって、ユーザ頭部の動きから導出してもよい。 Figure 3 shows the basic flow of image processing in this embodiment in comparison with the prior art. In this embodiment, the main display target is a three-dimensional world in which various objects exist. The state of the world changes according to the provisions of the program or the user's operation. In the case of the general processing shown in (a), the content server constantly acquires information on the content of the user's operation, the position of the viewpoint relative to the displayed world, and the direction of the line of sight. Hereinafter, the entire three-dimensional space to be displayed is called the "display world", and the state of the displayed world within or near the display field of view is called the "scene". The position of the viewpoint and the direction of the line of sight relative to the scene may also be collectively referred to simply as the "viewpoint". The viewpoint may be manually operated by the user via the input device 14, or may be derived from the movement of the user's head using a motion sensor provided in the head-mounted display.

　コンテンツサーバは、ユーザ操作に対応するようにシーンを変化させながら、視点情報に対応する視野で表示画像２００を描画する。コンテンツサーバは例えば、レイトレーシングやラスタライズなど周知のコンピュータグラフィクス描画技術により、表示画像２００を生成する。コンテンツサーバは生成した表示画像２００をクライアント端末へ送信し、クライアント端末はそれを表示装置に表示させる。図示する処理を所定のフレームレートで繰り返すことにより、クライアント端末側では、ユーザ操作等に応じたシーンの変化を表す動画像が表示される。 The content server draws the display image 200 in a field of view corresponding to the viewpoint information while changing the scene in response to user operations. The content server generates the display image 200 using well-known computer graphics drawing techniques such as ray tracing and rasterization. The content server transmits the generated display image 200 to the client terminal, which displays it on a display device. By repeating the illustrated process at a specified frame rate, a moving image showing the change in the scene in response to user operations, etc. is displayed on the client terminal side.

　（ｂ）が示す本実施の形態においても、コンテンツサーバ２０は、ユーザ操作の内容や、表示世界に対する視点の位置、視線の方向の情報を随時取得し、それに対応するようにシーンを変化させながら、画像を同様に描画する。一方、本実施の形態においてコンテンツサーバ２０は、当該画像を学習用画像２０２とし、機械学習の教師データに用いる。コンテンツサーバ２０は学習用画像２０２を収集して機械学習を行うことにより、シーンの３次元情報を表す３Ｄシーン情報２０４を生成する。 In this embodiment shown in (b), the content server 20 also acquires information on the content of user operations, the position of the viewpoint relative to the displayed world, and the direction of the gaze as needed, and similarly renders images while changing the scene accordingly. Meanwhile, in this embodiment, the content server 20 uses the images as training images 202 and as training data for machine learning. The content server 20 collects the training images 202 and performs machine learning to generate 3D scene information 204 that represents three-dimensional information about the scene.

　機械学習により３次元空間の情報を獲得する手法としてＮｅＲＦ(Neural Radiance Fields)がある（例えば、Ben Mildenhall、外５名、「NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis」、Communications of the ACM、２０２２年１月、第６５巻、第１号、p.９９－１０６参照）。本実施の形態においてＮｅＲＦを導入する場合、まず、学習用画像２０２を生成する際に定めた、それぞれの視点情報、すなわち仮想的な視点の位置と視線の方向を入力とし、対応する学習用画像２０２を教師データとして、多層パーセプトロン（ＭＬＰ：Multilayer perceptron）を用いた回帰により、シーンの３次元情報を表すデータを得る。 NeRF (Neural Radiance Fields) is a method for acquiring information about three-dimensional space through machine learning (see, for example, Ben Mildenhall et al., "NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis," Communications of the ACM, January 2022, Vol. 65, No. 1, pp. 99-106). When NeRF is introduced in this embodiment, first, the respective viewpoint information determined when generating the training images 202, i.e., the virtual viewpoint position and line of sight direction, are input, and the corresponding training images 202 are used as teacher data to obtain data representing the three-dimensional information of the scene by regression using a multilayer perceptron (MLP).

　このデータは、３次元空間における位置座標（ｘ，ｙ，ｚ）と方向ベクトルｄ（θ，φ）からなる５次元のパラメータを入力とし、体積密度σと３原色の色情報ｃ（ＲＧＢ）を出力とするニューラルネットワークである。本実施の形態では当該ニューラルネットワークのデータを「３Ｄシーン情報」と呼んでいる。なおＮｅＲＦについては様々な改良手法が提案されており、本実施の形態においても具体的な手法は特に限定されない。また機械学習の手法をＮｅＲＦに限定する主旨ではない。 This data is a neural network that takes five-dimensional parameters consisting of position coordinates (x, y, z) and directional vector d (θ, φ) in three-dimensional space as input, and outputs volume density σ and color information c (RGB) of the three primary colors. In this embodiment, the data of this neural network is called "3D scene information." Note that various improvement methods have been proposed for NeRF, and the specific method is not particularly limited in this embodiment. Furthermore, it is not intended to limit the machine learning method to NeRF.

　本実施の形態において学習用画像２０２が表す内容、ひいては３Ｄシーン情報２０４は時々刻々と変化し得る。図ではある一時刻、または一時刻と見なせる微小時間におけるシーンの３Ｄシーン情報２０４が生成される状況を表している。以後、一時刻、または一時刻と見なせる微小時間におけるシーンを「１つのシーン」と表現する場合がある。なお表示世界に動きがない場合は、時間によらず「１つのシーン」として扱うことができる。また、前のシーンに対し３Ｄシーン情報２０４が得られているとき、コンテンツサーバ２０は次のシーンに対応するように３Ｄシーン情報２０４を更新する。以後、３Ｄシーン情報の生成および更新を、３Ｄシーン情報の「取得」と総称する場合がある。 In this embodiment, the content represented by the learning image 202, and therefore the 3D scene information 204, can change from moment to moment. The figure shows the situation in which 3D scene information 204 of a scene at a certain time, or at an infinitesimal time that can be considered as a certain time, is generated. Hereinafter, a scene at a certain time, or at an infinitesimal time that can be considered as a certain time, may be expressed as "one scene". Note that if there is no movement in the displayed world, it can be treated as "one scene" regardless of time. Also, when 3D scene information 204 has been obtained for the previous scene, the content server 20 updates the 3D scene information 204 to correspond to the next scene. Hereinafter, the generation and updating of 3D scene information may be collectively referred to as "obtaining" 3D scene information.

　精度のよい３Ｄシーン情報２０４を取得するには、コンテンツサーバ２０は１つのシーンの学習用画像２０２を、なるべく多くの視点から収集することが望ましい。そのためコンテンツサーバ２０は、例えば次のような方法で学習用画像２０２を収集してもよい。
（１）実際に表示される画像の視野を規定する視点の周囲に、学習に適した視点を自ら生成し、対応する画像を生成する
（２）同じシーンを見ている複数のユーザの端末において表示される画像を流用する In order to obtain accurate 3D scene information 204, it is desirable for the content server 20 to collect learning images 202 of one scene from as many viewpoints as possible. For this reason, the content server 20 may collect the learning images 202, for example, in the following manner.
(1) Generate viewpoints suitable for learning around the viewpoint that defines the field of view of the image that is actually displayed, and generate corresponding images. (2) Reuse images displayed on the devices of multiple users viewing the same scene.

　以後、クライアント端末１０における実際の表示視野を規定する視点を「表示用視点」と呼び、学習用画像を生成する際に設定する「学習用視点」と区別する。コンテンツサーバ２０は（１）と（２）のどちらか一方のみを実施してもよいし、双方を実施してもよい。例えば（２）によって足りない視点を、（１）によって補ってもよい。コンテンツサーバ２０は、学習結果である３Ｄシーン情報２０４をクライアント端末１０へ送信する。クライアント端末１０は、送信された３Ｄシーン情報２０４を用いて表示画像２０６を生成する。 Hereinafter, the viewpoint that defines the actual display field of view in the client terminal 10 is called the "display viewpoint" to be distinguished from the "learning viewpoint" that is set when generating a learning image. The content server 20 may implement only one of (1) and (2), or may implement both. For example, a viewpoint that is missing due to (2) may be supplemented by (1). The content server 20 transmits 3D scene information 204, which is the result of the learning, to the client terminal 10. The client terminal 10 generates a display image 206 using the transmitted 3D scene information 204.

　３Ｄシーン情報２０４を用いることにより、クライアント端末１０は比較的低い負荷で、シーンを任意の視点から見た様子を高品質に表すことができる。ＮｅＲＦを適用する場合、クライアント端末１０は、表示用視点からビュースクリーンの各画素を通る光線（レイ）ｒを発生させ、その方向に沿って色を積分していくボリュームレンダリングにより、表示画像の画素値Ｃ（ｒ）を次のように求める。 By using the 3D scene information 204, the client terminal 10 can display the scene as it is viewed from any viewpoint with high quality, with a relatively low load. When NeRF is applied, the client terminal 10 generates a ray r that passes through each pixel on the view screen from the display viewpoint, and uses volume rendering to integrate the color along that direction to determine the pixel value C(r) of the display image as follows:

　ここでｔ_ｎ、ｔ_ｆはそれぞれ、レイｒの近位と遠位、Ｔ（ｔ）はレイの方向における累積透過率であり、次のように表される。 where t _n and t _f are the proximal and distal ends of the ray r, respectively, and T(t) is the cumulative transmittance in the direction of the ray, which can be expressed as follows:

　コンテンツサーバ２０は、シーンの動きに対し図示する処理を繰り返すことにより、３Ｄシーン情報２０４を所定のレートで更新していき、順次、クライアント端末１０に送信する。クライアント端末１０は、用いる３Ｄシーン情報を更新しながら表示画像２０６を生成することにより、コンテンツサーバ２０が生成する学習用画像２０２と同等の変化を有する動画像を、任意の視点から表現できる。例えばクライアント端末１０が、表示直前の視点からシーンを見た様子を表す表示画像２０６を、最新の３Ｄシーン情報２０４を用いて描画することにより、視点の変化に対し遅延の少ない画像を表示できる。 The content server 20 repeats the process shown in the figure for the movement of the scene, updating the 3D scene information 204 at a predetermined rate and sequentially transmitting it to the client terminal 10. The client terminal 10 generates a display image 206 while updating the 3D scene information it uses, thereby making it possible to represent moving images from any viewpoint that have the same changes as the learning images 202 generated by the content server 20. For example, the client terminal 10 can display an image with little delay in response to changes in viewpoint by drawing a display image 206 that shows the scene as seen from the viewpoint immediately before display using the latest 3D scene information 204.

　図４は、本実施の形態におけるクライアント端末１０およびコンテンツサーバ２０の機能ブロックの構成を示している。図示する機能ブロックにおける構成要素の機能は、本明細書にて記載された機能を実現するように構成され又はプログラムされた、汎用プロセッサ、特定用途プロセッサ、集積回路、ASICs (Application Specific Integrated Circuits)、CPU (a Central Processing Unit)、従来型の回路、および／又はそれらの組合せを含む、回路（circuitry）又は処理回路（processing circuitry）において実現されてよい。プロセッサは、トランジスタやその他の回路を含む回路（circuitry）又は処理回路（processing circuitry）とみなされる。プロセッサは、メモリに格納されたプログラムを実行する、プログラムされたプロセッサ（programmed processor）であってもよい。 FIG. 4 shows the functional block configuration of the client terminal 10 and the content server 20 in this embodiment. The functions of the components in the illustrated functional blocks may be realized in circuitry or processing circuitry including general purpose processors, application specific processors, integrated circuits, ASICs (Application Specific Integrated Circuits), a CPU (a Central Processing Unit), conventional circuits, and/or combinations thereof, configured or programmed to realize the functions described herein. A processor is considered to be a circuitry or processing circuitry including transistors and other circuits. A processor may be a programmed processor that executes programs stored in memory.

　本明細書において、回路（circuitry）、ユニット、手段は、記載された機能を実現するようにプログラムされたハードウェア、又は実行するハードウェアである。当該ハードウェアは、本明細書に開示されているあらゆるハードウェア、又は、当該記載された機能を実現するようにプログラムされた、又は、実行するものとして知られているあらゆるハードウェアであってもよい。当該ハードウェアが回路（circuitry）のタイプであるとみなされるプロセッサである場合、当該回路（circuitry）、手段、又はユニットは、ハードウェアと、当該ハードウェア及び又はプロセッサを構成する為に用いられるソフトウェアとの組合せである。 In this specification, a circuit, unit, or means is hardware that is programmed to realize or executes a described function. The hardware may be any hardware disclosed in this specification or any hardware that is programmed to realize or known to execute the described function. If the hardware is a processor, which is considered to be a type of circuit, the circuit, means, or unit is a combination of hardware and software used to configure the hardware and/or processor.

　クライアント端末１０は、ユーザ操作などの入力情報を取得する入力情報取得部５０、コンテンツサーバ２０から３Ｄシーン情報のデータを取得する３Ｄシーン情報取得部５２、取得した３Ｄシーン情報のデータを格納する３Ｄシーン情報記憶部５４、表示画像を生成する画像生成部５６、および、表示画像のデータを出力する出力部５８を備える。入力情報取得部５０は、ユーザ操作の内容を入力装置１４から随時取得する。ユーザ操作には、コンテンツの選択や起動、実施中のコンテンツに対するコマンド入力などが含まれる。 The client terminal 10 includes an input information acquisition unit 50 that acquires input information such as user operations, a 3D scene information acquisition unit 52 that acquires 3D scene information data from the content server 20, a 3D scene information storage unit 54 that stores the acquired 3D scene information data, an image generation unit 56 that generates a display image, and an output unit 58 that outputs the display image data. The input information acquisition unit 50 acquires the contents of user operations from the input device 14 at any time. User operations include the selection and activation of content, and command input for content currently being executed.

　入力情報取得部５０はまた、表示世界に対する表示用視点の情報を随時、あるいは所定の時間間隔で、入力装置１４やヘッドマウントディスプレイから取得する。ヘッドマウントディスプレイを装着したユーザの頭部の位置や姿勢を検出し、それに基づき視点の情報を取得する技術は周知であり、本実施の形態においてもそれを適用してよい。入力情報取得部５０は、取得した情報をコンテンツサーバ２０および画像生成部５６に適宜供給する。３Ｄシーン情報取得部５２は、継続的に更新される３Ｄシーン情報のデータを、コンテンツサーバ２０から順次取得する。 The input information acquisition unit 50 also acquires information on the display viewpoint for the displayed world from the input device 14 or head-mounted display at any time or at a specified time interval. Technology for detecting the position and posture of the head of a user wearing a head-mounted display and acquiring viewpoint information based on this is well known, and this may also be applied in this embodiment. The input information acquisition unit 50 supplies the acquired information to the content server 20 and the image generation unit 56 as appropriate. The 3D scene information acquisition unit 52 sequentially acquires data of 3D scene information, which is continuously updated, from the content server 20.

　後述するように３Ｄシーン情報取得部５２は、１つのシーンを表す複数種類の３Ｄシーン情報をコンテンツサーバ２０から取得する。３Ｄシーン情報記憶部５４は、３Ｄシーン情報取得部５２が取得した複数種類の３Ｄシーン情報のデータを格納する。３Ｄシーン情報取得部５２は、新たな３Ｄシーン情報を取得すると、３Ｄシーン情報記憶部５４に格納された、同じ種類の３Ｄシーン情報のデータを更新する。 As described below, the 3D scene information acquisition unit 52 acquires multiple types of 3D scene information representing one scene from the content server 20. The 3D scene information storage unit 54 stores the data of the multiple types of 3D scene information acquired by the 3D scene information acquisition unit 52. When the 3D scene information acquisition unit 52 acquires new 3D scene information, it updates the data of the same type of 3D scene information stored in the 3D scene information storage unit 54.

　画像生成部５６は、３Ｄシーン情報記憶部５４に直近に格納された３Ｄシーン情報のデータを用いて、所定のフレームレートで表示画像を描画する。ここで画像生成部５６は、最新の表示用視点を入力情報取得部５０から取得し、上述したボリュームレンダリングなどの手法により対応する視野で画像を描画する。機械学習を用いて、シーンの変遷に対応した３Ｄシーン情報を準備できれば、クライアント端末１０が表示画像を生成しても、通常のレイトレーシングなどの処理と比較し軽い負荷で、高品質な画像を描画できる。 The image generation unit 56 uses the 3D scene information data most recently stored in the 3D scene information storage unit 54 to draw a display image at a predetermined frame rate. Here, the image generation unit 56 acquires the latest display viewpoint from the input information acquisition unit 50, and draws an image in the corresponding field of view using a technique such as the volume rendering described above. If 3D scene information corresponding to the transition of the scene can be prepared using machine learning, then even if the client terminal 10 generates the display image, it is possible to draw a high-quality image with a lighter load than with normal processing such as ray tracing.

　画像生成部５６は、複数種類の３Ｄシーン情報を用いて描画を行うことにより、１つのシーンを表す表示画像を時間的、空間的、あるいはその双方で変化させる。例えばコンテンツサーバ２０が、情報の密度が低い順に３Ｄシーン情報のデータを送信する態様において、画像生成部５６は、送信された３Ｄシーン情報を順次用いて表示画像を更新していく。これにより、１つのシーンを表す画像を低遅延で表示できるとともに、徐々に解像度が上がることにより視認上の画質を維持できる。出力部５８は、画像補正部９２が描画した画像を所定のレートで表示装置１６に出力し表示させる。 The image generation unit 56 changes the display image representing one scene in time, space, or both by drawing using multiple types of 3D scene information. For example, in a mode in which the content server 20 transmits 3D scene information data in ascending order of information density, the image generation unit 56 updates the display image using the transmitted 3D scene information in sequence. This allows an image representing one scene to be displayed with low latency, and the visual image quality can be maintained by gradually increasing the resolution. The output unit 58 outputs the image drawn by the image correction unit 92 to the display device 16 at a predetermined rate for display.

　コンテンツサーバ２０は、クライアント端末１０から入力情報を取得する入力情報取得部７０、学習用視点を生成する学習用視点生成部７２、表示世界を制御する表示世界制御部７４、オブジェクトの３次元モデルを記憶する３次元モデル記憶部７６、学習用画像を生成する学習用画像生成部７８、複数種類の３Ｄシーン情報のデータを取得する種類別３Ｄシーン情報取得部８０、取得した３Ｄシーン情報のデータを格納する３Ｄシーン情報記憶部８４、および、３Ｄシーン情報のデータをクライアント端末１０へ送信する３Ｄシーン情報送信部８６を備える。 The content server 20 includes an input information acquisition unit 70 that acquires input information from the client terminal 10, a learning viewpoint generation unit 72 that generates a learning viewpoint, a display world control unit 74 that controls the display world, a 3D model storage unit 76 that stores 3D models of objects, a learning image generation unit 78 that generates learning images, a type-specific 3D scene information acquisition unit 80 that acquires data on multiple types of 3D scene information, a 3D scene information storage unit 84 that stores the acquired data on the 3D scene information, and a 3D scene information transmission unit 86 that transmits the data on the 3D scene information to the client terminal 10.

　入力情報取得部７０は、ユーザ操作の内容や視点の情報を、クライアント端末１０から随時、あるいは所定の時間間隔で取得する。学習用視点生成部７２は、学習用画像を生成するための学習用視点を複数生成する。学習用視点生成部７２は、入力情報取得部７０が取得した最新の表示用視点の周囲に、所定の規則で学習用視点を生成する。学習用視点生成部７２は例えば、最新の表示用視点を中心とし所定半径の球の内部に均等に、所定数の学習用視点を配置する。ここで所定半径とは、視点の移動に想定される最高速度に、当該データを用いた表示がなされるまでの最長時間を乗算した値などとする。 The input information acquisition unit 70 acquires the contents of user operations and viewpoint information from the client terminal 10 at any time or at a specified time interval. The learning viewpoint generation unit 72 generates multiple learning viewpoints for generating learning images. The learning viewpoint generation unit 72 generates learning viewpoints around the latest display viewpoint acquired by the input information acquisition unit 70 according to a specified rule. For example, the learning viewpoint generation unit 72 places a specified number of learning viewpoints evenly inside a sphere of a specified radius centered on the latest display viewpoint. Here, the specified radius is a value obtained by multiplying the maximum speed expected for viewpoint movement by the longest time until display using the data is performed, for example.

　学習用視点生成部７２は、学習用視点を均等に配置するのに限らず、表示世界の状況などに応じて、視点が動くと予想される範囲により多くの学習用視点を配置してもよい。また学習用視点生成部７２は、各位置の視点に対し所定数の方向の視線を均等に設定してもよいし、視線が動くと予想される方向により多くの視線を設定してもよい。なお学習用視点生成部７２は、最新の表示用視点自体も学習用視点としてよい。 The learning viewpoint generating unit 72 is not limited to distributing the learning viewpoints evenly, but may distribute more learning viewpoints in a range in which the viewpoint is expected to move, depending on the situation in the displayed world, etc. The learning viewpoint generating unit 72 may also set line of sight in a predetermined number of directions evenly for the viewpoint at each position, or may set more line of sight in directions in which the line of sight is expected to move. The learning viewpoint generating unit 72 may also use the latest display viewpoint itself as a learning viewpoint.

　表示世界制御部７４は、入力情報取得部７０が取得したユーザ操作の内容などに応じて、コンテンツとして表される３次元の表示世界を制御する。例えばコンテンツを電子ゲームとした場合、表示世界制御部７４は、電子ゲームの舞台となる仮想空間に、ユーザキャラクタなど必要なオブジェクトを配置し、ユーザが入力したコマンドやプログラムの規定に応じた動きを与える。 The display world control unit 74 controls the three-dimensional display world represented as content according to the content of the user operations acquired by the input information acquisition unit 70. For example, if the content is an electronic game, the display world control unit 74 places necessary objects such as user characters in the virtual space where the electronic game takes place, and gives them movement according to commands entered by the user and program specifications.

　３次元モデル記憶部７６には、表示世界に存在するオブジェクトの３次元モデルを格納しておき、表示世界制御部７４が適宜読み出すことにより表示世界の構築に用いる。学習用画像生成部７８は、学習用視点生成部７２が生成した複数の学習用視点から見たシーンの画像を、学習用画像として生成する。学習用画像生成部７８は好適には、レイトレーシングなど高品質な画像を描画できる手法を用いて学習用画像を生成する。 The three-dimensional model storage unit 76 stores three-dimensional models of objects that exist in the display world, and the display world control unit 74 reads them out as appropriate to use in constructing the display world. The learning image generation unit 78 generates images of the scene as viewed from multiple learning viewpoints generated by the learning viewpoint generation unit 72 as learning images. The learning image generation unit 78 preferably generates learning images using a technique capable of rendering high-quality images, such as ray tracing.

　種類別３Ｄシーン情報取得部８０は、学習用画像生成部７８が生成した学習用画像を用いて、上述したような機械学習により３Ｄシーン情報を生成するとともに、シーンの変化に対応するように更新していく。ここで種類別３Ｄシーン情報取得部８０は、１つのシーンを表す３Ｄシーン情報を、複数種類取得する。例えば種類別３Ｄシーン情報取得部８０は、表される情報の空間的な密度が互いに異なる複数の３Ｄシーン情報を取得する。以後、３Ｄシーン情報が保有する情報の空間的な密度を「情報密度」と呼ぶ。情報密度は、３Ｄシーン情報が有する情報の解像度、あるいは情報の空間周波数と言い換えてもよい。 The type-specific 3D scene information acquisition unit 80 uses the learning images generated by the learning image generation unit 78 to generate 3D scene information through machine learning as described above, and updates the information to correspond to changes in the scene. Here, the type-specific 3D scene information acquisition unit 80 acquires multiple types of 3D scene information representing one scene. For example, the type-specific 3D scene information acquisition unit 80 acquires multiple pieces of 3D scene information in which the spatial density of the represented information differs from one another. Hereinafter, the spatial density of the information held by the 3D scene information is referred to as "information density." Information density may also be referred to as the resolution of the information held by the 3D scene information, or the spatial frequency of the information.

　例えば上述したＮｅＲＦの文献によれば、学習時にニューラルネットワークに入力するベクトルを、高い周波数を含む高次元空間でのベクトルに変換するPositional Encodingを行うことにより、出力ベクトルの高周波数成分をより正確に表せるようにしている。変換に用いる関数γは次のように表される。 For example, according to the NeRF paper mentioned above, the vectors input to a neural network during training are converted into vectors in a high-dimensional space that includes high frequencies using Positional Encoding, which allows the high-frequency components of the output vector to be represented more accurately. The function γ used for the conversion is expressed as follows:

　上式においてパラメータＬを大きくするほど、高い周波数成分を含む詳細な３Ｄシーン情報を取得できる。これを利用し、種類別３Ｄシーン情報取得部８０は、パラメータＬを複数設定して個別に機械学習を行うことにより、情報密度の異なる複数の３Ｄシーン情報を並行して取得する。ただし３Ｄシーン情報の情報密度の制御手段はこれに限らない。例えばPositional Encodingの代わりに、複数解像度のグリッドを設定し、その頂点との位置関係に基づき入力ベクトルを表現するMultiresolution Hash Encodingを応用してもよい（例えば、Thomas Muller、外３名、「Instant Neural Graphics Primitives with a Multiresolution Hash Encoding」、ACM Transactions on Graphics、２０２２年７月、第４１巻、第４号、記事１０２、p.１－１５参照）。 In the above formula, the larger the parameter L is, the more detailed 3D scene information including higher frequency components can be acquired. Using this, the type-specific 3D scene information acquisition unit 80 acquires multiple pieces of 3D scene information with different information densities in parallel by setting multiple parameters L and performing machine learning individually. However, the means for controlling the information density of the 3D scene information is not limited to this. For example, instead of Positional Encoding, it is possible to apply Multiresolution Hash Encoding, which sets a grid with multiple resolutions and expresses an input vector based on the positional relationship with its vertices (see, for example, Thomas Muller et al., "Instant Neural Graphics Primitives with a Multiresolution Hash Encoding," ACM Transactions on Graphics, July 2022, Vol. 41, No. 4, Article 102, pp. 1-15).

　この場合、種類別３Ｄシーン情報取得部８０は、グリッドの解像度の数に対応するレベル数Ｌを複数設定して個別に機械学習を行うことにより、情報密度の異なる複数の３Ｄシーン情報を並行して取得できる。これらの場合、種類別３Ｄシーン情報取得部８０の内部のメモリには、パラメータＬとして設定する値を複数準備しておく。そして種類別３Ｄシーン情報取得部８０は、学習用画像生成部７８が１つのシーンに対し生成した複数の学習用画像を順次用いて、情報密度の異なる複数の３Ｄシーン情報を並行して取得し、３Ｄシーン情報記憶部８４に格納したり、格納済みの同種類の３Ｄシーン情報を更新したりする。 In this case, the type-specific 3D scene information acquisition unit 80 can acquire multiple pieces of 3D scene information with different information densities in parallel by setting multiple levels L corresponding to the number of grid resolutions and performing machine learning individually. In these cases, multiple values to be set as the parameter L are prepared in the internal memory of the type-specific 3D scene information acquisition unit 80. The type-specific 3D scene information acquisition unit 80 then sequentially uses multiple learning images generated for one scene by the learning image generation unit 78 to acquire multiple pieces of 3D scene information with different information densities in parallel, and stores them in the 3D scene information storage unit 84 or updates the same type of 3D scene information that has already been stored.

　この場合、同じシーンでも情報密度によって学習速度が異なり、情報密度が低いほど早く学習が完了するため、３Ｄシーン情報の更新も早くなる。なおグリッドを設定して機械学習を行う場合、種類別３Ｄシーン情報取得部８０は、レベル数Ｌを１つのみ設定して学習し、クライアント端末１０への送信時などに、異なるレベルのグリッド（例えばレベル０～１のグリッド、レベル０～２のグリッド、・・・、レベル０～Ｌ－１のグリッドなど）を個別に選択してデータを読み出すようにしてもよい。この場合も低レベルのグリッド、すなわち情報密度が低いグリッドほど早く学習が完了する点、ひいては後に述べる効果は同様である。 In this case, even for the same scene, the learning speed differs depending on the information density; the lower the information density, the faster learning is completed, and the faster the 3D scene information is updated. When setting a grid and performing machine learning, the type-specific 3D scene information acquisition unit 80 may set the level number L to only one and learn, and when transmitting to the client terminal 10, etc., select grids of different levels (e.g., grids of levels 0 to 1, grids of levels 0 to 2, ..., grids of levels 0 to L-1, etc.) individually to read out data. In this case, too, the learning is completed more quickly for lower level grids, i.e., grids with lower information density, and the effects described below are the same.

　種類別３Ｄシーン情報取得部８０が取得する３Ｄシーン情報の種類は、情報密度の区別に限らない。例えば種類別３Ｄシーン情報取得部８０は、表示世界における学習対象の領域やオブジェクトが異なる複数の３Ｄシーン情報を取得してもよい。この場合、表示世界において学習対象とする範囲が小さいほど早く学習が完了するため、３Ｄシーン情報の更新も早くなる。 The types of 3D scene information acquired by the type-specific 3D scene information acquisition unit 80 are not limited to distinctions in information density. For example, the type-specific 3D scene information acquisition unit 80 may acquire multiple pieces of 3D scene information with different areas or objects to be learned in the display world. In this case, the smaller the range to be learned in the display world, the faster the learning is completed, and therefore the faster the 3D scene information is updated.

　このように、種類別３Ｄシーン情報取得部８０が取得する複数種類の３Ｄシーン情報は、同じシーンであっても、それが表す情報の密度や範囲の大きさに依存して、更新完了までに要する時間に差が生じ得る。そのため種類別３Ｄシーン情報取得部８０は、３Ｄシーン情報記憶部８４に格納した複数種類の３Ｄシーン情報のそれぞれに、反映済みのシーンの時刻を対応づけて記録しておく。なお種類別３Ｄシーン情報取得部８０は、情報密度と学習対象の範囲の組み合わせが異なる複数種類の３Ｄシーン情報を取得してもよい。 In this way, even if the multiple types of 3D scene information acquired by the type-specific 3D scene information acquisition unit 80 are for the same scene, the time required to complete the update may differ depending on the density of the information and the size of the range it represents. For this reason, the type-specific 3D scene information acquisition unit 80 records the time of the reflected scene in association with each of the multiple types of 3D scene information stored in the 3D scene information storage unit 84. Note that the type-specific 3D scene information acquisition unit 80 may acquire multiple types of 3D scene information with different combinations of information density and range of the learning target.

　３Ｄシーン情報送信部８６は、３Ｄシーン情報記憶部８４に格納された最新の３Ｄシーン情報のデータを、クライアント端末１０へ送信する。３Ｄシーン情報送信部８６は、１つのシーンの学習が完了した３Ｄシーン情報から順に、クライアント端末１０に送信する。例えば情報密度の異なる３Ｄシーン情報のデータを送信対象とする場合、上述のとおり情報密度が低い３Ｄシーン情報から順に学習が完了する。したがって３Ｄシーン情報送信部８６は、情報密度が最低の３Ｄシーン情報の学習が完了した時点でそのデータを送信し、次に情報密度が低い３Ｄシーン情報の学習が完了した時点でそのデータを送信する、といったように、最高の情報密度の３Ｄシーン情報までを徐々に送信していく。 The 3D scene information transmission unit 86 transmits the latest 3D scene information data stored in the 3D scene information storage unit 84 to the client terminal 10. The 3D scene information transmission unit 86 transmits to the client terminal 10 3D scene information in order of completion of learning for one scene. For example, when data of 3D scene information with different information densities is to be transmitted, learning is completed in order from the 3D scene information with the lowest information density, as described above. Therefore, the 3D scene information transmission unit 86 transmits data of the 3D scene information with the lowest information density when learning of that information is completed, and transmits data of the 3D scene information with the next lowest information density when learning of that information is completed, and so on, gradually transmitting up to the 3D scene information with the highest information density.

　３Ｄシーン情報送信部８６は内部に分割部８８を含む。分割部８８は、１つのシーンを表す複数種類の３Ｄシーン情報のそれぞれを、複数のデータに分割する。１つのシーンを表す３Ｄシーン情報は、種類ごとに個別のニューラルネットワークにより構成される。分割部８８は、各ニューラルネットワークを複数のニューラルネットワークに分割する。ここで分割とは、ハッシュテーブルなどで関連付けられるノードの構造は維持したまま、互いに異なる一部のノードを取り除くことを意味する。分割後のニューラルネットワークにおいて取り除くノードはランダムに決定する。 The 3D scene information transmission unit 86 includes a division unit 88 inside. The division unit 88 divides each of the multiple types of 3D scene information representing one scene into multiple data. Each type of 3D scene information representing one scene is composed of an individual neural network. The division unit 88 divides each neural network into multiple neural networks. Here, division means removing some of the nodes that are different from each other while maintaining the structure of the nodes that are associated by a hash table or the like. The nodes to be removed in the neural network after division are determined randomly.

　過学習を緩和するため、ニューラルネットワークのノードの一部をランダムに不活性化させる、ドロップアウトと呼ばれる手法が知られている（例えば、Nitish Srivastava、外４名、「Dropout: A Simple Way to Prevent Neural Networks from Overfitting」、Journal of Machine Learning Research、２０１４年６月、第１５巻、p.1919-1958参照）。このドロップアウトから明らかなように、ニューラルネットワークは、一部のノードを不活性化しても学習結果の精度は一定以上に保たれる。 To alleviate overfitting, a technique called dropout is known, in which some of the nodes in a neural network are randomly deactivated (see, for example, Nitish Srivastava et al., "Dropout: A Simple Way to Prevent Neural Networks from Overfitting," Journal of Machine Learning Research, June 2014, Vol. 15, pp. 1919-1958). As is clear from dropout, the accuracy of the learning results of a neural network can be maintained at a certain level even if some of the nodes are deactivated.

　この特性を利用すると、ノードの一部を除いた３Ｄシーン情報を用いても、クライアント端末１０は比較的高い精度で表示画像を生成できることになる。そこでコンテンツサーバ２０が、同じ３Ｄシーン情報を表す複数のニューラルネットワークを、一部のノードを除いたうえで送信することにより、データサイズの増大を抑えつつ、パケットロスに対する頑健性を高めることができる。クライアント端末１０の画像生成部５６は、パケットロスがなければ分割前のニューラルネットワークを復元して表示画像を描画できる。パケットロスがあれば、画像生成部５６は、取得できたニューラルネットワークを用いて表示画像を描画する。上述のとおり、この場合も表示画像の生成は可能になる。 By utilizing this characteristic, the client terminal 10 can generate a display image with relatively high accuracy even when using 3D scene information with some of the nodes removed. Therefore, the content server 20 can transmit multiple neural networks representing the same 3D scene information with some of the nodes removed, thereby suppressing an increase in data size and increasing robustness against packet loss. If there is no packet loss, the image generation unit 56 of the client terminal 10 can restore the neural network before division and draw the display image. If there is packet loss, the image generation unit 56 draws the display image using the neural network that it has acquired. As described above, it is possible to generate a display image in this case as well.

　また上述のとおり、コンテンツサーバ２０からは、１つのシーンを表す複数種類の３Ｄシーン情報が、時間差をもって到達する場合がある。このときクライアント端末１０の画像生成部５６は、新たに取得した３Ｄシーン情報を用いて表示画像を更新する。表示画像の解像度は３Ｄシーン情報の情報密度に対応するため、情報密度の低い３Ｄシーン情報から順に取得することにより、表示画像の解像度が徐々に上がることになる。ここで情報密度の低い３Ｄシーン情報は、ボリュームレンダリングにおけるサンプリング数が小さくなるため、表示画像の描画速度が高くなる。 As mentioned above, multiple types of 3D scene information representing one scene may arrive from the content server 20 with a time lag. In this case, the image generation unit 56 of the client terminal 10 updates the display image using the newly acquired 3D scene information. Since the resolution of the display image corresponds to the information density of the 3D scene information, the resolution of the display image gradually increases by acquiring 3D scene information in order starting with the 3D scene information with the lowest information density. Here, 3D scene information with low information density has a smaller number of samples in volume rendering, and therefore the drawing speed of the display image increases.

　結果として、コンテンツサーバ２０における学習時間の短い、情報密度の低い３Ｄシーン情報をまず取得し、低解像度の画像を短時間で描画することにより、各シーンの学習開始から表示までの時間を格段に短縮できる。また最終的には、情報密度の高い３Ｄシーン情報を用いて高精細な画像を表示できるため、見た目への影響を抑えながら低遅延での表示を実現できる。上述のとおり３Ｄシーン情報の種類は解像度の区別に限らない。例えばコンテンツサーバ２０は、ユーザが注視している部分的な領域のみを短時間で学習して先に送信し、シーンの全体の３Ｄシーン情報を後から送信してもよい。この場合も同様の原理により、注視している領域を低遅延で表示させ、周囲の像を追って更新するような画像表現が可能になる。 As a result, by first acquiring low-information-density 3D scene information that requires a short learning time in the content server 20 and then drawing a low-resolution image in a short time, the time from the start of learning each scene to its display can be significantly shortened. Furthermore, ultimately, high-definition images can be displayed using high-information-density 3D scene information, making it possible to achieve low-latency display while minimizing the impact on appearance. As mentioned above, the types of 3D scene information are not limited to distinctions in resolution. For example, the content server 20 may learn only the partial area that the user is gazing at in a short time and transmit it first, and then transmit the 3D scene information for the entire scene later. In this case as well, a similar principle can be used to display an image in which the area being gazed at is displayed with low latency and updated to track the surrounding image.

　なお３Ｄシーン情報送信部８６は、１つのシーンを表す３Ｄシーン情報を、種類によって異なる通信プロトコルによりクライアント端末１０へ送信してもよい。例えば３Ｄシーン情報送信部８６は、画像の精細さを優先すべき領域の３Ｄシーン情報については、信頼性の高いＴＣＰ／ＩＰ（Transmission Control Protocol/Internet Protocol）により送信し、動きの低遅延性を優先すべき領域の３Ｄシーン情報については、転送速度の高いＵＤＰ（User Datagram Protocol）により送信してもよい。このように、３Ｄシーン情報の種類と、それに適した通信プロトコルとの対応関係は、３Ｄシーン情報送信部８６の内部メモリなどにあらかじめ設定しておく。 The 3D scene information transmitting unit 86 may transmit 3D scene information representing one scene to the client terminal 10 using different communication protocols depending on the type of 3D scene information. For example, the 3D scene information transmitting unit 86 may transmit 3D scene information for an area where image detail should be prioritized using the highly reliable TCP/IP (Transmission Control Protocol/Internet Protocol), and transmit 3D scene information for an area where low latency in movement should be prioritized using UDP (User Datagram Protocol), which has a high transfer rate. In this way, the correspondence between the type of 3D scene information and the appropriate communication protocol is preset in the internal memory of the 3D scene information transmitting unit 86, etc.

　図５は、コンテンツサーバ２０が３Ｄシーン情報を取得する手順を模式的に示している。まず表示世界制御部７４は、例えば敵キャラクタ２１２が存在する表示世界２１０を構築する。上述のとおり表示世界２１０には動きがあってよいが、ここでは１つのシーンに対応する表示世界２１０を表している。これに対し学習用視点生成部７２が、最新の表示用視点などに基づき複数の学習用視点を生成し、学習用画像生成部７８が、各学習用視点に対応する学習用画像２１４ａ、２１４ｂ、２１４ｃ等を生成する。なお生成する学習用画像の数は限定されない。 FIG. 5 shows a schematic diagram of the procedure by which the content server 20 acquires 3D scene information. First, the display world control unit 74 constructs a display world 210 in which, for example, an enemy character 212 exists. As mentioned above, the display world 210 may be in motion, but here, the display world 210 corresponding to one scene is shown. In response to this, the learning viewpoint generation unit 72 generates multiple learning viewpoints based on the latest display viewpoint, etc., and the learning image generation unit 78 generates learning images 214a, 214b, 214c, etc. corresponding to each learning viewpoint. Note that there is no limit to the number of learning images to be generated.

　種類別３Ｄシーン情報取得部８０は、学習用画像２１４ａ、２１４ｂ、２１４ｃ等を用いて機械学習を行い、複数種類の３Ｄシーン情報を更新する。各３Ｄシーン情報の実体は、ニューラルネットワーク２１６ａ、２１６ｂ、２１６ｃ、・・・である。ニューラルネットワーク２１６ａ、２１６ｂ、２１６ｃ、・・・は情報密度、および表す範囲の少なくともどちらかが異なる。情報密度を異ならせる場合、種類別３Ｄシーン情報取得部８０は例えば、上述したパラメータＬをそれぞれ設定することにより学習を行う。この場合、情報密度が低い３Ｄシーン情報ほど、学習完了が早くなる。 The type-specific 3D scene information acquisition unit 80 performs machine learning using the learning images 214a, 214b, 214c, etc., and updates multiple types of 3D scene information. The substance of each piece of 3D scene information is a neural network 216a, 216b, 216c, .... The neural networks 216a, 216b, 216c, ... differ in at least one of the information density and the range they represent. When the information density is to be varied, the type-specific 3D scene information acquisition unit 80 performs learning, for example, by setting the above-mentioned parameter L for each. In this case, the lower the information density of the 3D scene information, the quicker learning is completed.

　表す範囲を異ならせる場合、種類別３Ｄシーン情報取得部８０は例えば、学習用画像２１４ａ、２１４ｂ、２１４ｃから対応する領域を切り出して学習を行う。この場合、表す範囲が狭い３Ｄシーン情報ほど、学習完了が早くなる。情報密度と表す範囲を組みあわせて変化させる場合、そのバランスによって、学習完了までの時間は様々となる。定性的には、情報密度が高いほど、また、表す範囲が広いほど、学習完了が遅くなるため、適切な遅延時間が得られるように、情報密度と範囲の広さのバランスを最適化しておく。図示する処理を所定のレートで繰り返すことにより、ニューラルネットワーク２１６ａ、２１６ｂ、２１６ｃ、・・・がそれぞれ、シーンの変遷に応じて更新される。 When the range to be represented is varied, the type-specific 3D scene information acquisition unit 80 performs learning by, for example, cutting out the corresponding area from the learning images 214a, 214b, 214c. In this case, the narrower the range of the 3D scene information represented, the sooner the learning is completed. When the information density and the range to be represented are changed in combination, the time until learning is completed varies depending on the balance. Qualitatively, the higher the information density and the wider the range represented, the slower the learning is completed, so the balance between information density and range is optimized to obtain an appropriate delay time. By repeating the illustrated process at a predetermined rate, the neural networks 216a, 216b, 216c, ... are each updated in accordance with the changes in the scene.

　図６は、異なる範囲の３Ｄシーン情報を取得する態様を説明するための図である。図５で示した表示世界２１０の一部のシーンが表示画像２２２として表されているとする。種類別３Ｄシーン情報取得部８０は例えば、表示画像２２２上で重要な領域２２４に対応する、表示世界２１０内の範囲２２６を表す３Ｄシーン情報と、その外側を含むシーン全体を表す３Ｄシーン情報とを個別に取得する。ここで重要な領域２２４とは例えば、ユーザの注視点から所定範囲の領域、表示画像の中心から所定範囲の領域、戦況や獲得した物が表されている領域、敵キャラクタやユーザキャラクタなど主要なオブジェクトが存在する領域などであり、領域の選択規則はコンテンツの内容などに応じてあらかじめ設定しておく。 FIG. 6 is a diagram for explaining the manner in which 3D scene information of different ranges is acquired. Assume that a part of the scene of the display world 210 shown in FIG. 5 is displayed as the display image 222. The type-specific 3D scene information acquisition unit 80, for example, separately acquires 3D scene information representing a range 226 in the display world 210 corresponding to an important area 224 on the display image 222, and 3D scene information representing the entire scene including the outside of that range. The important area 224 here is, for example, an area within a predetermined range from the user's gaze point, an area within a predetermined range from the center of the display image, an area showing the battle situation or acquired items, an area where main objects such as enemy characters and user characters exist, etc., and the rules for selecting the area are set in advance according to the content of the content, etc.

　あるいは種類別３Ｄシーン情報取得部８０は、表示世界２１０に存在する、主要なオブジェクトそのもの、また当該オブジェクトを含む所定サイズの範囲を直接特定し、学習対象として３Ｄシーン情報を取得してもよい。ユーザの注視点に基づき学習する範囲を決定する場合、周知の注視点検出器をクライアント端末１０に設ける。そしてコンテンツサーバ２０の入力情報取得部７０は、クライアント端末１０から注視点の情報を所定のレートで取得し、個別に学習する範囲を決定する。 Alternatively, the type-specific 3D scene information acquisition unit 80 may directly identify the main objects themselves that exist in the display world 210, or a range of a predetermined size that includes the objects, and acquire 3D scene information as the learning target. When determining the learning range based on the user's gaze point, a well-known gaze point detector is provided in the client terminal 10. The input information acquisition unit 70 of the content server 20 then acquires gaze point information from the client terminal 10 at a predetermined rate and determines the learning range individually.

　種類別３Ｄシーン情報取得部８０は、クライアント端末１０において表示中の、表示画像２２２に対応する表示世界の範囲と、その外側を含む、より広い範囲とを個別に学習してもよい。いずれにしろ種類別３Ｄシーン情報取得部８０は、学習用視点生成部７２が生成した複数の学習用画像から、対応する一部の領域を切り出して機械学習を行うとともに、学習用画像全体を用いるなどして、より広い範囲の機械学習も行う。 The type-specific 3D scene information acquisition unit 80 may separately learn the range of the display world corresponding to the display image 222 being displayed on the client terminal 10 and a wider range including the outside of that range. In any case, the type-specific 3D scene information acquisition unit 80 performs machine learning by extracting corresponding partial areas from the multiple learning images generated by the learning viewpoint generation unit 72, and also performs machine learning of a wider range by using the entire learning images, for example.

　なお３Ｄシーン情報が表す範囲のバリエーションは２つに限らず３つ以上としてもよい。また範囲の包含関係は限定されず、独立した複数の範囲のそれぞれに対し３Ｄシーン情報を取得してもよい。いずれにしろ設定した範囲の数だけ３Ｄシーン情報が取得される。範囲が狭いほど学習時間および描画時間が短縮されるため、低遅延での表示が可能になる。この特性を利用すれば、重要な領域２２４に対応する３Ｄシーン情報の情報密度をある程度高くしても、範囲が狭ければ、情報密度の低い３Ｄシーン情報を用いて広範囲の画像を描画するのと同等の時間で描画を完了させることができる。結果として、重要な領域２２４を高精細かつ低遅延で表示させることも可能になる。 The number of variations in ranges represented by the 3D scene information is not limited to two, and may be three or more. The inclusion relationship of the ranges is not limited, and 3D scene information may be obtained for each of multiple independent ranges. In any case, 3D scene information is obtained for the number of set ranges. The narrower the range, the shorter the learning time and drawing time, making it possible to display with low latency. By utilizing this characteristic, even if the information density of the 3D scene information corresponding to the important area 224 is increased to a certain extent, if the range is narrow, it is possible to complete drawing in the same amount of time as drawing a wide-area image using 3D scene information with low information density. As a result, it becomes possible to display the important area 224 with high definition and low latency.

　図７は、３Ｄシーン情報の送受信におけるデータの変遷を模式的に示している。図の上から下に向けて時間経過を表し、（ａ）から（ｃ）にかけてはコンテンツサーバ２０におけるデータの状態遷移、（ｃ）から（ｅ）にかけてはクライアント端末１０におけるデータの状態遷移を表している。またこの図は、１つのシーンを表す３Ｄシーン情報について示している。まずコンテンツサーバ２０の種類別３Ｄシーン情報取得部８０は、（ａ）に示すように、新たに生成された学習用画像に基づき、複数種類の３Ｄシーン情報にそれぞれ対応する、複数のニューラルネットワーク２３０ａ、２３０ｂ、・・・を取得する。 Figure 7 shows a schematic diagram of data transitions during transmission and reception of 3D scene information. The diagram shows the passage of time from top to bottom, with (a) to (c) showing data state transitions in the content server 20 and (c) to (e) showing data state transitions in the client terminal 10. This diagram also shows 3D scene information representing one scene. First, the type-specific 3D scene information acquisition unit 80 of the content server 20 acquires multiple neural networks 230a, 230b, ... each corresponding to multiple types of 3D scene information, based on a newly generated learning image, as shown in (a).

　次にコンテンツサーバ２０の３Ｄシーン情報送信部８６は、（ｂ）に示すように、ニューラルネットワーク２３０ａ、２３０ｂをそれぞれ分割する。すなわち３Ｄシーン情報送信部８６は、ニューラルネットワーク２３０ａのうち、互いに異なる一部のノードが除外された複数のニューラルネットワーク２３２ａ、２３２ｂを生成する。また３Ｄシーン情報送信部８６は、ニューラルネットワーク２３０ｂのうち、互いに異なる一部のノードが除外された複数のニューラルネットワーク２３４ａ、２３４ｂを生成する。 Next, the 3D scene information transmission unit 86 of the content server 20 divides each of the neural networks 230a, 230b as shown in (b). That is, the 3D scene information transmission unit 86 generates a plurality of neural networks 232a, 232b from the neural network 230a in which some mutually different nodes have been removed. The 3D scene information transmission unit 86 also generates a plurality of neural networks 234a, 234b from the neural network 230b in which some mutually different nodes have been removed.

　図では分割後のニューラルネットワーク２３２ａ、２３２ｂ、２３４ａ、２３４ｂのうち、元のニューラルネットワーク２３０ａ、２３０ｂから除外されたノードを点線で示している。あるニューラルネットワーク２３０ａの分割後のニューラルネットワーク２３２ａ、２３２ｂにおいて、一方のニューラルネットワーク２３２ａで除いたノードは、他方のニューラルネットワーク２３２ｂで残すようにすると、組みあわせることで元のニューラルネットワーク２３０ａが完全に復元できることになる。 In the figure, the nodes that have been removed from the original neural networks 230a and 230b of the divided neural networks 232a, 232b, 234a, and 234b are shown with dotted lines. If, after dividing a neural network 230a, the neural networks 232a and 232b are obtained and the nodes that have been removed in one neural network 232a are left in the other neural network 232b, then by combining the two, the original neural network 230a can be completely restored.

　ただし上述のように、パケットロスなどにより足りないノードが生じても、残りのニューラルネットワークにより画像の生成が可能である。なおニューラルネットワークの分割数は２つに限らない。３Ｄシーン情報送信部８６の分割部８８は好適には、ドロップアウトと同様の手法で、除外対象のノードをランダムに決定する。３Ｄシーン情報送信部８６は分割後のニューラルネットワーク２３２ａ、２３２ｂ、２３４ａ、２３４ｂをパケット化してクライアント端末１０に送信する。 However, as mentioned above, even if there are missing nodes due to packet loss or the like, it is possible to generate an image using the remaining neural networks. The number of divisions of the neural network is not limited to two. The division unit 88 of the 3D scene information transmission unit 86 preferably randomly determines nodes to be excluded using a method similar to dropout. The 3D scene information transmission unit 86 packetizes the divided neural networks 232a, 232b, 234a, and 234b and transmits them to the client terminal 10.

　この際、３Ｄシーン情報送信部８６は（ｃ）に示すように、複数のニューラルネットワークのパケットの送信順をランダムに入れ替える。図ではニューラルネットワーク２３２ｂ、２３４ｂ、２３２ａ、・・・の順に送信することを、横並びの配列で示している。送信順を入れ替えることにより、連続したパケットロスにより、ある種類の３Ｄシーン情報が全て失われる可能性を低くできる。ただし入れ替えの対象は、許容される所定時間内に並列に更新されたニューラルネットワーク間で行う。これにより、入れ替えによって描画処理が必要以上に遅延しないようにする。 At this time, the 3D scene information transmission unit 86 randomly rearranges the transmission order of packets from multiple neural networks, as shown in (c). In the figure, the transmission order is shown as being arranged horizontally, with neural networks 232b, 234b, 232a, .... By rearranging the transmission order, it is possible to reduce the possibility that all 3D scene information of a certain type will be lost due to continuous packet loss. However, the rearrangement is performed between neural networks that have been updated in parallel within a certain allowable time. This prevents unnecessary delays in the rendering process due to the rearrangement.

　クライアント端末１０の３Ｄシーン情報取得部５２は、パケットを順次取得すると、（ｄ）に示すように、取り出したニューラルネットワークの順序を元に戻すことで、分割前のニューラルネットワークを再構成する。このためコンテンツサーバ２０は、送信するニューラルネットワーク２３２ｂ、２３４ｂ、２３２ａ、・・・のそれぞれに、元のニューラルネットワーク２３０ａ、２３０ｂ、・・・のどれを分割したものなのかを示すメタデータを付加しておく。 When the 3D scene information acquisition unit 52 of the client terminal 10 acquires the packets in sequence, it reconstructs the neural network before division by returning the order of the extracted neural networks to their original order, as shown in (d). For this reason, the content server 20 adds metadata to each of the neural networks 232b, 234b, 232a, ... to be transmitted, indicating which of the original neural networks 230a, 230b, ... it has divided.

　図の例では、３Ｄシーン情報取得部５２は、ニューラルネットワーク２３２ａ、２３２ｂを取得できたため、元のニューラルネットワーク２３０ａを完全に復元できる。一方、元のニューラルネットワーク２３０ｂについては、点線枠２３６で示すように、分割してなるニューラルネットワークのうち片方のニューラルネットワーク２３４ａがパケットロスにより取得できていない。この場合、元のニューラルネットワーク２３０ｂを完全には復元できない。いずれにしろクライアント端末１０の画像生成部５６は、取得できたニューラルネットワーク２３２ａ、２３２ｂ、２３４ｂを用いて、（ｅ）に示すように表示画像２３８を描画する。 In the example shown in the figure, the 3D scene information acquisition unit 52 is able to acquire the neural networks 232a and 232b, and is therefore able to completely restore the original neural network 230a. On the other hand, as shown by the dotted frame 236, one of the split neural networks, neural network 234a, cannot be acquired due to packet loss for the original neural network 230b. In this case, the original neural network 230b cannot be completely restored. In any case, the image generation unit 56 of the client terminal 10 uses the acquired neural networks 232a, 232b, and 234b to draw the display image 238, as shown in (e).

　具体的には画像生成部５６は、最新の表示用視点から見たシーンの画像を、ニューラルネットワーク２３２ａ、２３２ｂ、２３４ｂを用いたボリュームレンダリングにより描画する。複数種類の３Ｄシーン情報が時間差をもって送信される場合、画像生成部５６は、最新の３Ｄシーン情報を用いて、表示画像２３８の少なくとも一部を更新する。これにより、画像の解像度が徐々に上がったり、重要な像が特に低遅延で動いたりする表示を実現できる。 Specifically, the image generation unit 56 draws an image of the scene seen from the latest display viewpoint by volume rendering using the neural networks 232a, 232b, and 234b. When multiple types of 3D scene information are transmitted with a time difference, the image generation unit 56 updates at least a portion of the display image 238 using the latest 3D scene information. This makes it possible to realize a display in which the image resolution gradually increases and important images move with particularly low latency.

　図８は、コンテンツサーバ２０における機械学習と、クライアント端末１０における画像描画の時間的関係を説明するための図である。ここでは一例として、情報密度の異なる複数種類の３Ｄシーン情報を送信するケースを想定する。図の横方向が時間軸であり、コンテンツサーバ２０における学習時間、およびクライアント端末１０における表示画像の各フレームの描画時間をそれぞれ矩形で示している。各描画時間の矩形に付した番号はフレームの順番を示す。 Figure 8 is a diagram for explaining the temporal relationship between machine learning in the content server 20 and image drawing in the client terminal 10. Here, as an example, we consider a case in which multiple types of 3D scene information with different information densities are transmitted. The horizontal direction of the figure is the time axis, with the learning time in the content server 20 and the drawing time of each frame of the displayed image in the client terminal 10 each shown as a rectangle. The number attached to each drawing time rectangle indicates the frame order.

　上段に示すようにコンテンツサーバ２０の種類別３Ｄシーン情報取得部８０は、第１種、第２種、・・・第ｎ種の３Ｄシーン情報を個別に学習する。ここで序数は情報密度の高さに対応し、第１種が最低の情報密度、第ｎ種が最高の情報密度とする。図示するように、種類別３Ｄシーン情報取得部８０が、複数種類の３Ｄシーン情報の学習を同時に開始したとしても、情報密度が高いほど学習の完了までに時間を要する。コンテンツサーバ２０はまず、第１種の３Ｄシーン情報の学習が完了した時点で、矢印Ａ１に示すように、クライアント端末１０にそれを送信する。 As shown in the upper part, the type-specific 3D scene information acquisition unit 80 of the content server 20 learns the first, second, ... nth types of 3D scene information individually. Here, the ordinal numbers correspond to the information density, with the first type being the lowest information density and the nth type being the highest information density. As shown in the figure, even if the type-specific 3D scene information acquisition unit 80 starts learning multiple types of 3D scene information simultaneously, the higher the information density, the longer it takes to complete the learning. First, when the content server 20 has completed learning the first type of 3D scene information, it transmits it to the client terminal 10, as shown by arrow A1.

　クライアント端末１０は、第１種の３Ｄシーン情報を用いて、時刻ｔ１から、番号０、１、２のフレームの画像を最低解像度で描画していく。一方、コンテンツサーバ２０は、第２種の３Ｄシーン情報の学習が完了した時点で、矢印Ａ２に示すように、クライアント端末１０にそれを送信する。クライアント端末１０は、第２種の３Ｄシーン情報を取得した時刻ｔ２の直後に描画を開始する、番号３のフレームから、第２種の情報密度に対応する解像度で画像を描画する。同様の処理を繰り返すことで、フレームの解像度が徐々に増加していく。 The client terminal 10 uses the first type of 3D scene information to draw images of frames numbered 0, 1, and 2 at the lowest resolution from time t1. Meanwhile, when the content server 20 has completed learning the second type of 3D scene information, it transmits it to the client terminal 10, as indicated by arrow A2. The client terminal 10 begins drawing frames numbered 3 immediately after time t2 when the second type of 3D scene information is acquired, and draws images at a resolution corresponding to the second type of information density. By repeating the same process, the frame resolution gradually increases.

　コンテンツサーバ２０が第ｎ種の３Ｄシーン情報の学習を完了したら、矢印Ａｎに示すようにクライアント端末１０に送信する。クライアント端末１０は、当該３Ｄシーン情報を取得した時刻ｔｎの直後に描画を開始する、番号ｍ＋２のフレームから、第ｎ種の情報密度に対応する最高解像度で画像を描画する。本実施の形態ではこのように、同じシーンを表す３Ｄシーン情報を複数種類学習し、早くに学習が完了する３Ｄシーン情報を即時にクライアント端末１０に送信する。これにより、例えば第ｎ種など情報密度の高い３Ｄシーン情報のみを送信する場合と比較し、フレームの描画開始時刻を格段に早めることができる。 Once the content server 20 has completed learning the nth type of 3D scene information, it transmits it to the client terminal 10 as indicated by arrow An. The client terminal 10 begins drawing the image at the highest resolution corresponding to the nth type of information density, starting with frame number m+2, which begins drawing immediately after time tn when the 3D scene information is acquired. In this manner, in this embodiment, multiple types of 3D scene information representing the same scene are learned, and the 3D scene information that is completed learning the earliest is immediately transmitted to the client terminal 10. This allows the frame drawing start time to be significantly earlier than when only 3D scene information with a high information density, such as the nth type, is transmitted.

　すなわちコンテンツサーバ２０において学習が開始される時刻を基準とすると、データ伝送のための時間Ｔａと、第１種の３Ｄシーン情報の学習に要する時間Ｔｂのみの遅延で描画を開始できる。なお実際には、時刻ｔ１の前には、直前に送信された前のシーンの３Ｄシーン情報を用いたフレームの描画が行われてよい。また画像生成部５６は、各フレームの描画時、直前に取得した最新の表示用視点に基づきボリュームレンダリングを行う。これにより、フレーム番号０からｍ＋２にかけて、単に解像度が増加するのみならず、視点の動きも低遅延で反映させた画像を表示できる。ここで第１種など３Ｄシーン情報の情報密度が低いほど、ボリュームレンダリングにおけるサンプリング数が少なく描画速度が高くなるため、より少ない遅延での表示が可能になる。 In other words, if the time when learning starts in the content server 20 is used as a reference, drawing can start with a delay of only the time Ta for data transmission and the time Tb required for learning the first type of 3D scene information. Note that in practice, before time t1, drawing of a frame may be performed using the 3D scene information of the previous scene transmitted immediately before. Furthermore, when drawing each frame, the image generating unit 56 performs volume rendering based on the most recent display viewpoint acquired immediately before. This not only increases the resolution from frame numbers 0 to m+2, but also makes it possible to display images that reflect the movement of the viewpoint with low delay. Here, the lower the information density of the first type or other 3D scene information, the fewer the number of samples in volume rendering and the higher the drawing speed, making it possible to display with less delay.

　このように画像生成部５６は、表示済みか否かに関わらず、一旦、生成された画像を、最新の表示用視点に合うように補正する機能も有しているといえる。図９は、クライアント端末１０の画像生成部５６による画像補正処理を説明するための図である。画像２５０ａは表示画像のフレームまたはその一部であり、前景として円筒形のオブジェクトの像２５２ａが表されている。視点の動きにより相対的に円筒形のオブジェクトの位置が変化し、Δｔの時間経過後の画像２５０ｂにおいてオブジェクトの像２５２ｂがずれると、画像２５０ａでは隠蔽されていた背景の領域２５４を新たに描画する必要が生じる。 In this way, image generation unit 56 can be said to have the function of correcting an image that has already been generated so that it matches the latest display viewpoint, regardless of whether it has already been displayed. Figure 9 is a diagram for explaining the image correction process by image generation unit 56 of client terminal 10. Image 250a is a frame or part of a display image, and shows image 252a of a cylindrical object in the foreground. If the position of the cylindrical object changes relatively due to movement of the viewpoint, and object image 252b shifts in image 250b after the lapse of time Δt, it becomes necessary to newly draw background area 254 that was hidden in image 250a.

　３Ｄシーン情報を用いず、コンテンツサーバ２０から送信された画像を表示する従来技術では、元の画像２５０ａでは表されていない領域２５４を新たに作り出すことが難しい。オブジェクトとその周囲の領域の情報を含む３Ｄシーン情報を用いれば、最新の視点に対応するように、画像２５０ｂの全ての画素を決定できるため、当然、領域２５４の描画も可能になる。また表示用視点が動き、オブジェクトおよび光源との位置関係が変化すれば、オブジェクトの像２５２ｂの色味も変化し得る。これについても３Ｄシーン情報を用いて、新たな表示用視点から見た像を描画することにより、色味の変化を正確に表現できる。これにより、Δｔの時間経過後の画像２５０ｃを精度よく生成できる。 In conventional technology that displays images sent from the content server 20 without using 3D scene information, it is difficult to create a new area 254 that is not shown in the original image 250a. If 3D scene information containing information on the object and its surrounding area is used, all pixels of image 250b can be determined to correspond to the latest viewpoint, so it is naturally possible to draw area 254. Furthermore, if the display viewpoint moves and the positional relationship with the object and light source changes, the color of the image 252b of the object may also change. In this case, too, the change in color can be accurately expressed by using the 3D scene information to draw an image seen from a new display viewpoint. This makes it possible to generate image 250c after a time Δt has passed with high accuracy.

　なおこれまで述べたように、クライアント端末１０が３Ｄシーン情報を用いて、表示画像の各フレームを描画することを前提とすると、画像２５０ａ、２５０ｃはそれぞれ、その時点での最新の３Ｄシーン情報を用いてクライアント端末１０が描画した画像である。一方、コンテンツサーバ２０から、３Ｄシーン情報とともに表示画像も送信する態様を想定し、コンテンツサーバ２０から送信された画像２５０ａを、クライアント端末１０が３Ｄシーン情報を用いて補正することも可能である。 As described above, assuming that the client terminal 10 uses 3D scene information to draw each frame of the display image, images 250a and 250c are images that the client terminal 10 has drawn using the latest 3D scene information at that time. On the other hand, assuming a situation in which the content server 20 transmits the display image together with the 3D scene information, it is also possible for the client terminal 10 to correct image 250a transmitted from the content server 20 using the 3D scene information.

　例えばクライアント端末１０は、コンテンツサーバ２０側で画像２５０ａを生成した時点から、それをクライアント端末１０で表示するまでの時間差Δｔの間に生じた視点の動きを反映するように補正した画像２５０ｃを生成する。この場合もクライアント端末１０の画像生成部５６は、コンテンツサーバ２０が生成した画像２５０ａでは表されていない領域２５４や、色味が変化した像２５２ｂを、最新の３Ｄシーン情報を用いて描画し直す。 For example, the client terminal 10 generates an image 250c that has been corrected to reflect the movement of the viewpoint that occurs during the time difference Δt between when the image 250a is generated on the content server 20 side and when it is displayed on the client terminal 10. In this case as well, the image generating unit 56 of the client terminal 10 redraws the area 254 that is not shown in the image 250a generated by the content server 20 and the image 252b whose color has changed, using the latest 3D scene information.

　この場合、コンテンツサーバ２０において生成された高品質な画像２５０ａと、クライアント端末１０で新たに描画した領域とを合成することになるため、画像生成部５６は、情報密度の高い３Ｄシーン情報を用いて必要な領域を描画することが望ましい。なお画像生成部５６は、コンテンツサーバ２０から送信された画像２５０ａにおける像を移動させたり変形させたりするのみでも違和感が少ない状況においては、３Ｄシーン情報を用いた新たな描画処理を省略できる。例えば表示用視点の変化に対し像のずれ量や色味の変化が小さい、遠方にあるオブジェクトについては、画像生成部５６は、画像２５０ａに直接加工を施すなどして補正してもよい。 In this case, since the high-quality image 250a generated by the content server 20 is to be combined with the area newly drawn by the client terminal 10, it is desirable for the image generation unit 56 to draw the necessary area using 3D scene information with high information density. Note that the image generation unit 56 can omit new drawing processing using 3D scene information in a situation where there is little sense of incongruity simply by moving or deforming the image in the image 250a transmitted from the content server 20. For example, for distant objects where the amount of image shift or color change is small in response to changes in the display viewpoint, the image generation unit 56 may perform correction by directly processing the image 250a.

　そのため画像生成部５６にはあらかじめ、３Ｄシーン情報を用いて描画し直す必要がある領域を判定する規則を、内部のメモリなどに格納しておく。例えば画像生成部５６は、表示用視点の速度がしきい値を超えたとき、または超えると予測されるとき、視点からの距離がしきい値以下のオブジェクトとその周囲の領域について、３Ｄシーン情報を用いて描画し直してもよい。この場合、コンテンツサーバ２０は、表示世界のジオメトリ情報もクライアント端末１０に送信する。これにより画像生成部５６は、表示用視点とオブジェクトとの距離の変化を取得できる。 For this reason, the image generation unit 56 stores in advance in an internal memory or the like rules for determining areas that need to be redrawn using 3D scene information. For example, when the speed of the display viewpoint exceeds a threshold value, or is predicted to exceed it, the image generation unit 56 may use 3D scene information to redraw objects whose distance from the viewpoint is less than or equal to the threshold value, and their surrounding areas. In this case, the content server 20 also transmits geometry information of the display world to the client terminal 10. This allows the image generation unit 56 to obtain changes in the distance between the display viewpoint and the object.

　以上述べた本実施の形態によれば、コンテンツサーバ２０は、コンテンツの表示世界を表す学習用画像を生成し、各シーンの３Ｄシーン情報を複数種類、取得する。コンテンツサーバ２０は学習が完了した３Ｄシーン情報を順次、クライアント端末１０に送信し、クライアント端末１０はそれを用いて、最新の表示用視点に対応する表示画像のフレームを描画する。３Ｄシーン情報の種類として、情報の密度や表す範囲の広さを様々に設定することにより、学習時間が変化し、ひいては表示までの遅延時間も制御できる。さらに部分的に詳細度を高めることもできるため、画像上の重要性などを加味した、低遅延で高品質な画像を表示できる。 According to the present embodiment described above, the content server 20 generates learning images that represent the display world of the content, and acquires multiple types of 3D scene information for each scene. The content server 20 sequentially transmits the 3D scene information for which learning has been completed to the client terminal 10, and the client terminal 10 uses it to draw frames of the display image corresponding to the latest display viewpoint. By setting various types of 3D scene information, such as the density of information and the width of the range to be represented, the learning time can be changed, and the delay time until display can also be controlled. Furthermore, the level of detail can be increased in some areas, making it possible to display high-quality images with low latency that take into account the importance of the image, etc.

　またコンテンツサーバ２０は、１つの３Ｄシーン情報を構成するニューラルネットワークを、複数のニューラルネットワークに分割してクライアント端末１０に送信する。これにより、データサイズの増大を抑えつつ、パケットロス対する頑健性を高められる。以上のことから、コンテンツサーバ２０からの配信を伴う画像処理において、配信が介在することによる影響を軽減でき、ユーザ体験の質を高めることができる。 The content server 20 also divides the neural network that constitutes one piece of 3D scene information into multiple neural networks and transmits them to the client terminal 10. This makes it possible to improve robustness against packet loss while suppressing increases in data size. As a result, in image processing involving distribution from the content server 20, the impact of distribution can be reduced, improving the quality of the user experience.

　以上、本発明を実施の形態をもとに説明した。実施の形態は例示であり、それらの各構成要素や各処理プロセスの組合せにいろいろな変形例が可能なこと、またそうした変形例も本発明の範囲にあることは当業者に理解されるところである。 The present invention has been described above based on an embodiment. The embodiment is merely an example, and it will be understood by those skilled in the art that various modifications are possible in the combination of each component and each processing process, and that such modifications are also within the scope of the present invention.

　以上のように本発明は、コンテンツサーバ、ゲーム装置、ヘッドマウントディスプレイ、表示装置、携帯端末、パーソナルコンピュータなど各種情報処理装置や、それらのいずれかを含む画像表示システムなどに利用可能である。 As described above, the present invention can be used in various information processing devices such as content servers, game devices, head-mounted displays, display devices, mobile terminals, and personal computers, as well as image display systems that include any of these.

　本開示は、以下の態様を含んでよい。
［項目１］
　コンテンツサーバであって、以下のように構成された回路（circuitry configured to）を備え、
　前記回路（circuitry）は、
　ユーザ操作に応じて状況が変化する３次元の表示世界の各シーンを、複数の視点から見た様子を表す画像を、学習用画像として生成し、
　前記学習用画像を教師データとする機械学習により、各シーンの３次元情報を表す複数種類の３Ｄシーン情報を取得し、
　前記３Ｄシーン情報を用いて表示画像を描画するクライアント端末に、前記複数種類の３Ｄシーン情報のデータを、シーンごとの機械学習が完了した順に送信する、
　コンテンツサーバ。
［項目２］
　前記回路（circuitry）は、
　空間的な情報の密度が異なる、複数の前記３Ｄシーン情報を取得する、項目１に記載のコンテンツサーバ。
［項目３］
　前記回路（circuitry）は、
　１つの前記３Ｄシーン情報を構成するニューラルネットワークを、互いに異なる一部のノードが除外された複数のニューラルネットワークに分割したうえ、前記クライアント端末に送信する、項目１に記載のコンテンツサーバ。
［項目４］
　前記回路（circuitry）は、分割後の前記ニューラルネットワークをパケット化するとともに、その送信順をランダムに入れ替える、項目３に記載のコンテンツサーバ。
［項目５］
　前記回路（circuitry）は、前記表示世界における異なる範囲を表す、複数の前記３Ｄシーン情報を取得する、項目１に記載のコンテンツサーバ。
［項目６］
　前記回路（circuitry）は、前記クライアント端末において表示されている画像において定めた領域に対応する、前記表示世界の範囲を表す前記３Ｄシーン情報を取得する、項目５に記載のコンテンツサーバ。
［項目７］
　前記回路（circuitry）は、前記表示世界に存在するオブジェクトを対象とする３Ｄシーン情報を取得する、項目５に記載のコンテンツサーバ。
［項目８］
　前記回路（circuitry）は、前記複数種類の３Ｄシーン情報のデータを、異なる通信プロトコルで前記クライアント端末に送信する、項目１に記載のコンテンツサーバ。
［項目９］
　クライアント端末であって、以下のように構成された回路（circuitry configured to）を備え、
　前記回路（circuitry）は、
　ユーザ操作の情報と、３次元の表示世界に対する表示用視点の情報とを取得し、
　前記ユーザ操作に応じて状況が変化する前記表示世界の各シーンに対し、機械学習により取得された、３次元情報を表す複数種類の３Ｄシーン情報のデータを、サーバから取得し、
　最新の前記表示用視点に基づき、直近に取得された前記３Ｄシーン情報を用いて、表示画像のフレームの少なくとも一部を描画する、
　クライアント端末。
［項目１０］
　前記回路（circuitry）は、１つの前記３Ｄシーン情報を構成するニューラルネットワークを分割してなる、互いに異なる一部のノードが除外された複数のニューラルネットワークを取得し、分割前のニューラルネットワークを再構成する、項目９に記載のクライアント端末。
［項目１１］
　前記回路（circuitry）は、空間的な情報の密度が異なる複数の前記３Ｄシーン情報を、当該情報の密度が低い順に取得し、
　取得された前記３Ｄシーン情報の前記情報の密度に対応するように、前記フレームの解像度を変化させる、項目９に記載のクライアント端末。
［項目１２］
　前記回路（circuitry）は、前記サーバから送信された表示画像のうち、前記表示用視点の変化により描画が必要と判定された領域を、前記３Ｄシーン情報を用いて描画する、項目９に記載のクライアント端末。
［項目１３］
　ユーザ操作に応じて状況が変化する３次元の表示世界の画像を表示させるクライアント端末と、表示画像の生成に用いるデータを送信するコンテンツサーバと、を含み、前記クライアント端末と前記コンテンツサーバは、以下のように構成された回路（circuitry configured to）を備え、
　前記コンテンツサーバが備える回路（circuitry）は、
　前記表示世界の各シーンを、複数の視点から見た様子を表す画像を、学習用画像として生成し、
　前記学習用画像を教師データとする機械学習により、各シーンの３次元情報を表す複数種類の３Ｄシーン情報を取得し、
　前記クライアント端末に、前記複数種類の３Ｄシーン情報のデータを、シーンごとの機械学習が完了した順に送信し、
　前記クライアント端末が備える回路（circuitry）は、
　前記ユーザ操作の情報と、前記表示世界に対する表示用視点の情報とを取得し、
　前記複数種類の３Ｄシーン情報のデータを、前記コンテンツサーバから取得し、
　最新の前記表示用視点に基づき、直近に取得された前記３Ｄシーン情報を用いて、表示画像のフレームの少なくとも一部を描画する、
　画像表示システム。
［項目１４］
　ユーザ操作に応じて状況が変化する３次元の表示世界の各シーンを、複数の視点から見た様子を表す画像を、学習用画像として生成し、
　前記学習用画像を教師データとする機械学習により、各シーンの３次元情報を表す複数種類の３Ｄシーン情報を取得し、
　前記３Ｄシーン情報を用いて表示画像を描画するクライアント端末に、前記複数種類の３Ｄシーン情報のデータを、シーンごとの機械学習が完了した順に送信する、
　表示用データ送信方法。
［項目１５］
　ユーザ操作の情報と、３次元の表示世界に対する表示用視点の情報とを取得し、
　前記ユーザ操作に応じて状況が変化する前記表示世界の各シーンに対し、機械学習により取得された、３次元情報を表す複数種類の３Ｄシーン情報のデータを、サーバから取得し、
　最新の前記表示用視点に基づき、直近に取得された前記３Ｄシーン情報を用いて、表示画像のフレームの少なくとも一部を描画する、
　表示画像生成方法。
［項目１６］
　ユーザ操作に応じて状況が変化する３次元の表示世界の各シーンを、複数の視点から見た様子を表す画像を、学習用画像として生成する機能と、
　前記学習用画像を教師データとする機械学習により、各シーンの３次元情報を表す複数種類の３Ｄシーン情報を取得する機能と、
　前記３Ｄシーン情報を用いて表示画像を描画するクライアント端末に、前記複数種類の３Ｄシーン情報のデータを、シーンごとの機械学習が完了した順に送信する機能と、
　をコンピュータに実現させるためのプログラムを記録した記録媒体。
［項目１７］
　ユーザ操作の情報と、３次元の表示世界に対する表示用視点の情報とを取得する機能と、
　前記ユーザ操作に応じて状況が変化する前記表示世界の各シーンに対し、機械学習により取得された、３次元情報を表す複数種類の３Ｄシーン情報のデータを、サーバから取得する機能と、
　最新の前記表示用視点に基づき、直近に取得された前記３Ｄシーン情報を用いて、表示画像のフレームの少なくとも一部を描画する機能と、
　をコンピュータ実現させるためのプログラムを記録した記録媒体。 The present disclosure may include the following aspects.
[Item 1]
A content server, comprising:
The circuitry includes:
Generate learning images that represent scenes from multiple viewpoints of a three-dimensional display world in which the situation changes in response to user operations;
acquiring a plurality of types of 3D scene information representing three-dimensional information of each scene by machine learning using the learning images as training data;
Transmitting the plurality of types of 3D scene information data to a client terminal that draws a display image using the 3D scene information in the order in which machine learning for each scene is completed;
Content server.
[Item 2]
The circuitry includes:
2. The content server according to item 1, wherein a plurality of pieces of 3D scene information are acquired, each having a different spatial information density.
[Item 3]
The circuitry includes:
A content server according to item 1, which divides a neural network constituting one of the 3D scene information into multiple neural networks from which some nodes that are different from each other are excluded, and transmits the multiple neural networks to the client terminal.
[Item 4]
4. The content server according to claim 3, wherein the circuitry packetizes the neural network after division and randomly changes the transmission order of the packets.
[Item 5]
2. The content server of claim 1, wherein the circuitry obtains a plurality of pieces of 3D scene information representing different areas in the display world.
[Item 6]
6. The content server of claim 5, wherein the circuitry obtains the 3D scene information representing a range of the display world that corresponds to a defined area in an image displayed on the client terminal.
[Item 7]
6. The content server of claim 5, wherein the circuitry obtains 3D scene information for objects present in the display world.
[Item 8]
2. The content server according to claim 1, wherein the circuitry transmits data of the plurality of types of 3D scene information to the client terminal using different communication protocols.
[Item 9]
A client terminal, comprising:
The circuitry includes:
acquiring information on a user operation and information on a display viewpoint for a three-dimensional display world;
Acquire from a server a plurality of types of 3D scene information data representing three-dimensional information acquired by machine learning for each scene in the displayed world whose situation changes in response to the user operation;
Rendering at least a portion of a frame of a display image using the most recently acquired 3D scene information based on an updated viewing viewpoint.
Client terminal.
[Item 10]
The client terminal according to item 9, wherein the circuitry obtains a plurality of neural networks obtained by dividing a neural network that constitutes one of the 3D scene information, with some nodes that are different from each other being excluded, and reconstructs the neural network before division.
[Item 11]
The circuitry acquires a plurality of pieces of 3D scene information having different spatial information densities in ascending order of spatial information density;
10. A client terminal as described in item 9, which varies the resolution of the frames to correspond to the information density of the acquired 3D scene information.
[Item 12]
10. The client terminal according to item 9, wherein the circuitry uses the 3D scene information to draw an area of the display image transmitted from the server that is determined to require drawing due to a change in the display viewpoint.
[Item 13]
A client terminal that displays an image of a three-dimensional display world in which a situation changes in response to a user operation, and a content server that transmits data used to generate the display image, the client terminal and the content server having a circuitry configured to:
The circuitry of the content server includes:
generating images representing each scene of the display world as seen from a plurality of viewpoints as learning images;
By machine learning using the learning images as training data, a plurality of types of 3D scene information representing three-dimensional information of each scene is obtained;
Transmitting the plurality of types of 3D scene information data to the client terminal in the order in which machine learning for each scene is completed;
The circuitry of the client terminal includes:
acquiring information on the user operation and information on a display viewpoint for the display world;
acquiring the plurality of types of 3D scene information data from the content server;
Rendering at least a portion of a frame of a display image using the most recently acquired 3D scene information based on an updated viewing viewpoint.
Image display system.
[Item 14]
Generate learning images that represent scenes from multiple viewpoints of a three-dimensional display world in which the situation changes in response to user operations;
By machine learning using the learning images as training data, a plurality of types of 3D scene information representing three-dimensional information of each scene is obtained;
Transmitting the plurality of types of 3D scene information data to a client terminal that draws a display image using the 3D scene information in the order in which machine learning for each scene is completed;
A method for transmitting data for display.
[Item 15]
Acquiring information on a user operation and information on a display viewpoint for a three-dimensional display world;
Acquire from a server a plurality of types of 3D scene information data representing three-dimensional information acquired by machine learning for each scene in the displayed world whose situation changes in response to the user operation;
Rendering at least a portion of a frame of a display image using the most recently acquired 3D scene information based on an updated viewing viewpoint.
Display image generation method.
[Item 16]
A function of generating images for learning that represent each scene in a three-dimensional display world, the situation of which changes in response to user operations, as seen from multiple viewpoints;
A function of acquiring a plurality of types of 3D scene information representing three-dimensional information of each scene by machine learning using the learning images as training data;
A function of transmitting the data of the plurality of types of 3D scene information to a client terminal that draws a display image using the 3D scene information in the order in which machine learning for each scene is completed;
A recording medium on which a program for realizing the above on a computer is recorded.
[Item 17]
A function for acquiring information on user operations and information on a display viewpoint for a three-dimensional display world;
A function of acquiring from a server a plurality of types of 3D scene information data representing three-dimensional information acquired by machine learning for each scene in the displayed world whose situation changes in response to the user operation;
Rendering at least a portion of a frame of a display image using the most recently acquired 3D scene information based on an updated display viewpoint;
A recording medium on which a program for implementing the above on a computer is recorded.

　１　画像表示システム、　１０　クライアント端末、　１４　入力装置、　１６　表示装置、　２０　コンテンツサーバ、　５０　入力情報取得部、　５２　３Ｄシーン情報取得部、　５４　３Ｄシーン情報記憶部、　５６　画像生成部、　５８　出力部、　７０　入力情報取得部、　７２　学習用視点生成部、　７４　表示世界制御部、　７６　３次元モデル記憶部、　７８　学習用画像生成部、　８０　種類別３Ｄシーン情報取得部、　８４　種類別３Ｄシーン情報記憶部、　８６　３Ｄシーン情報送信部、　８８　分割部、　１２２　ＣＰＵ、　１２４　ＧＰＵ、　１２６　メインメモリ。 1 Image display system, 10 Client terminal, 14 Input device, 16 Display device, 20 Content server, 50 Input information acquisition unit, 52 3D scene information acquisition unit, 54 3D scene information storage unit, 56 Image generation unit, 58 Output unit, 70 Input information acquisition unit, 72 Learning viewpoint generation unit, 74 Display world control unit, 76 3D model storage unit, 78 Learning image generation unit, 80 Type-specific 3D scene information acquisition unit, 84 Type-specific 3D scene information storage unit, 86 3D scene information transmission unit, 88 Splitting unit, 122 CPU, 124 GPU, 126 Main memory.

Claims

a learning image generating unit that generates, as learning images, images that represent how each scene in a three-dimensional displayed world, whose situation changes in response to a user operation, is viewed from a plurality of viewpoints;
a type-specific 3D scene information acquisition unit that acquires multiple types of 3D scene information representing three-dimensional information of each scene by machine learning using the learning images as teacher data;
a 3D scene information transmission unit that transmits data of the plurality of types of 3D scene information to a client terminal that draws a display image using the 3D scene information in the order in which machine learning for each scene is completed;
A content server comprising:

The content server according to claim 1, characterized in that the type-specific 3D scene information acquisition unit acquires multiple pieces of 3D scene information with different spatial information densities.

The content server according to claim 1 or 2, characterized in that the 3D scene information transmission unit divides the neural network that constitutes one piece of 3D scene information into multiple neural networks from which some mutually different nodes are excluded, and transmits the divided neural networks to the client terminal.

The content server according to claim 3, characterized in that the 3D scene information transmission unit packetizes the neural network after division and randomly changes the transmission order.

The content server according to claim 1 or 2, characterized in that the type-specific 3D scene information acquisition unit acquires multiple pieces of 3D scene information representing different ranges in the display world.

The content server according to claim 5, characterized in that the type-specific 3D scene information acquisition unit acquires the 3D scene information that represents the range of the display world that corresponds to an area defined in the image displayed on the client terminal.

The content server according to claim 5, characterized in that the type-specific 3D scene information acquisition unit acquires 3D scene information targeting objects present in the display world.

The content server according to claim 1, characterized in that the 3D scene information transmission unit transmits the data of the multiple types of 3D scene information to the client terminal using different communication protocols.

an input information acquisition unit that acquires information on a user operation and information on a display viewpoint for a three-dimensional display world;
a 3D scene information acquisition unit that acquires from a server a plurality of types of 3D scene information data representing three-dimensional information acquired by machine learning for each scene in the displayed world whose situation changes in response to the user operation;
an image generator configured to render at least a portion of a frame of a display image using the most recently acquired 3D scene information based on a latest display viewpoint;
A client terminal comprising:

The client terminal according to claim 9, characterized in that the 3D scene information acquisition unit acquires multiple neural networks obtained by dividing a neural network constituting one piece of 3D scene information, with some mutually different nodes removed, and reconstructs the neural network before division.

The 3D scene information acquisition unit acquires a plurality of pieces of 3D scene information having different spatial information densities in order of decreasing density of the information,
11. The client terminal according to claim 9, wherein the image generating unit changes a resolution of the frames so as to correspond to the information density of the acquired 3D scene information.

The client terminal according to claim 9 or 10, characterized in that the image generation unit uses the 3D scene information to render an area of the display image transmitted from the server that is determined to require rendering due to a change in the display viewpoint.

The present invention includes a client terminal that displays an image of a three-dimensional display world in which a situation changes in response to a user operation, and a content server that transmits data used to generate the display image,
The content server,
a learning image generating unit that generates images representing how each scene in the display world is viewed from a plurality of viewpoints as learning images;
a type-specific 3D scene information acquisition unit that acquires multiple types of 3D scene information representing three-dimensional information of each scene by machine learning using the learning images as teacher data;
a 3D scene information transmission unit that transmits the plurality of types of 3D scene information data to the client terminal in the order in which machine learning for each scene is completed;
Equipped with
The client terminal includes:
an input information acquisition unit that acquires information on the user operation and information on a display viewpoint for the display world;
a 3D scene information acquisition unit that acquires the plurality of types of 3D scene information data from the content server;
an image generator configured to render at least a portion of a frame of a display image using the most recently acquired 3D scene information based on a latest display viewpoint;
An image display system comprising:

generating, as learning images, images representing how each scene in a three-dimensional displayed world, in which a situation changes in response to a user operation, is viewed from a plurality of viewpoints;
acquiring a plurality of types of 3D scene information representing three-dimensional information of each scene by machine learning using the learning images as training data;
transmitting data of the plurality of types of 3D scene information to a client terminal that renders a display image using the 3D scene information in the order in which machine learning for each scene is completed;
A display data transmission method comprising:

acquiring information on a user operation and information on a display viewpoint for a three-dimensional display world;
acquiring from a server a plurality of types of 3D scene information data representing three-dimensional information acquired by machine learning for each scene in the displayed world whose situation changes in response to the user operation;
Rendering at least a portion of a frame of a display image using the most recently acquired 3D scene information based on an updated viewing viewpoint;
A display image generating method comprising:

A function of generating images for learning that represent each scene in a three-dimensional display world, the situation of which changes in response to user operations, as seen from multiple viewpoints;
A function of acquiring a plurality of types of 3D scene information representing three-dimensional information of each scene by machine learning using the learning images as training data;
A function of transmitting the data of the plurality of types of 3D scene information to a client terminal that draws a display image using the 3D scene information in the order in which machine learning for each scene is completed;
A computer program characterized by causing a computer to execute the above.

A function for acquiring information on user operations and information on a display viewpoint for a three-dimensional display world;
A function of acquiring from a server a plurality of types of 3D scene information data representing three-dimensional information acquired by machine learning for each scene in the displayed world whose situation changes in response to the user operation;
Rendering at least a portion of a frame of a display image using the most recently acquired 3D scene information based on an updated display viewpoint;
A computer program characterized by causing a computer to execute the above.