[go: up one dir, main page]

WO2025120714A1 - Content server, client terminal, image display system, display data transmission method, and display image generation method - Google Patents

Content server, client terminal, image display system, display data transmission method, and display image generation method Download PDF

Info

Publication number
WO2025120714A1
WO2025120714A1 PCT/JP2023/043347 JP2023043347W WO2025120714A1 WO 2025120714 A1 WO2025120714 A1 WO 2025120714A1 JP 2023043347 W JP2023043347 W JP 2023043347W WO 2025120714 A1 WO2025120714 A1 WO 2025120714A1
Authority
WO
WIPO (PCT)
Prior art keywords
scene
scene information
display
information
image
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
PCT/JP2023/043347
Other languages
French (fr)
Japanese (ja)
Inventor
新宇 張
徳秀 金子
和之 有松
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Sony Interactive Entertainment Inc
Original Assignee
Sony Interactive Entertainment Inc
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Sony Interactive Entertainment Inc filed Critical Sony Interactive Entertainment Inc
Priority to PCT/JP2023/043347 priority Critical patent/WO2025120714A1/en
Publication of WO2025120714A1 publication Critical patent/WO2025120714A1/en
Pending legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • AHUMAN NECESSITIES
    • A63SPORTS; GAMES; AMUSEMENTS
    • A63FCARD, BOARD, OR ROULETTE GAMES; INDOOR GAMES USING SMALL MOVING PLAYING BODIES; VIDEO GAMES; GAMES NOT OTHERWISE PROVIDED FOR
    • A63F13/00Video games, i.e. games using an electronically generated display having two or more dimensions
    • A63F13/50Controlling the output signals based on the game progress
    • A63F13/52Controlling the output signals based on the game progress involving aspects of the displayed game scene
    • A63F13/525Changing parameters of virtual cameras
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T19/00Manipulating 3D models or images for computer graphics
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L67/00Network arrangements or protocols for supporting network services or applications
    • H04L67/01Protocols
    • H04L67/10Protocols in which an application is distributed across nodes in the network
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L67/00Network arrangements or protocols for supporting network services or applications
    • H04L67/01Protocols
    • H04L67/131Protocols for games, networked simulations or virtual reality

Definitions

  • This invention relates to a content server, a client terminal, an image display system, a display data transmission method, and a display image generation method that display images of a three-dimensional display world.
  • electronic content in which video images generated in real time in response to user operations are distributed from a server can utilize the server's abundant processing environment, making it easier to display high-quality images while minimizing the impact of the client terminal's processing performance.
  • the constant transmission of operation information from the client terminal and the data transmission processing from the server that receives it can cause the display to be unable to keep up with viewpoint operations or images to be lost due to packet loss, resulting in a deterioration in the quality of the user experience.
  • the present invention was made in consideration of these problems, and its purpose is to provide a technology that reduces the impact of distribution on the quality of the user experience when processing images of content that is distributed from a server.
  • This content server is characterized by having a learning image generation unit that generates learning images that represent the state of each scene in a three-dimensional displayed world, where the situation changes in response to user operations, as viewed from multiple viewpoints; a type-specific 3D scene information acquisition unit that acquires multiple types of 3D scene information that represent the three-dimensional information of each scene through machine learning using the learning images as teacher data; and a 3D scene information transmission unit that transmits data on the multiple types of 3D scene information to a client terminal that uses the 3D scene information to draw a display image, in the order in which machine learning for each scene is completed.
  • This client terminal is characterized by having an input information acquisition unit that acquires information on user operations and information on a display viewpoint for a three-dimensional display world, a 3D scene information acquisition unit that acquires from a server data on multiple types of 3D scene information that represent three-dimensional information acquired by machine learning for each scene in the display world whose situation changes according to user operations, and an image generation unit that draws at least a portion of a frame of a display image using the most recently acquired 3D scene information based on the latest display viewpoint.
  • This image display system includes a client terminal that displays an image of a three-dimensional display world in which the situation changes according to user operations, and a content server that transmits data used to generate the display image.
  • the content server includes a learning image generation unit that generates images showing how each scene in the display world looks from multiple viewpoints as learning images, a type-specific 3D scene information acquisition unit that acquires multiple types of 3D scene information that represent the three-dimensional information of each scene by machine learning using the learning images as teacher data, and a 3D scene information transmission unit that transmits the multiple types of 3D scene information data to the client terminal in the order in which machine learning for each scene is completed.
  • the client terminal includes an input information acquisition unit that acquires information on user operations and information on a display viewpoint for the display world, a 3D scene information acquisition unit that acquires data on the multiple types of 3D scene information from the content server, and an image generation unit that draws at least a part of the frame of the display image using the most recently acquired 3D scene information based on the latest display viewpoint.
  • This display data transmission method is characterized by including the steps of: generating, as learning images, images that represent how each scene in a three-dimensional display world, in which the situation changes in response to user operations, is viewed from multiple viewpoints; acquiring multiple types of 3D scene information that represent the three-dimensional information of each scene through machine learning using the learning images as teacher data; and transmitting data on the multiple types of 3D scene information to a client terminal that uses the 3D scene information to draw a display image, in the order in which machine learning for each scene is completed.
  • Another aspect of the present invention relates to a display image generating method, which is characterized by including the steps of: acquiring information on user operations and information on a display viewpoint for a three-dimensional display world; acquiring from a server data on multiple types of 3D scene information representing three-dimensional information acquired by machine learning for each scene in the display world whose situation changes according to the user operations; and drawing at least a part of a frame of a display image using the most recently acquired 3D scene information based on the latest display viewpoint.
  • the impact of distribution on the quality of the user experience can be reduced.
  • FIG. 1 is a diagram showing a configuration example of an image display system to which the present embodiment can be applied;
  • FIG. 2 is a diagram illustrating an internal circuit configuration of a client terminal according to the present embodiment.
  • FIG. 1 is a diagram showing a basic flow of image processing according to the present embodiment in comparison with the prior art.
  • 2 is a diagram showing a configuration of functional blocks of a client terminal and a content server according to the present embodiment.
  • FIG. 10 is a diagram illustrating a procedure in which a content server acquires 3D scene information in the present embodiment.
  • FIG. 1A to 1C are diagrams for explaining how 3D scene information of different ranges is acquired in the present embodiment.
  • 10 is a diagram showing a schematic diagram of data transition during transmission and reception of 3D scene information in the present embodiment.
  • 11A and 11B are diagrams for explaining an image correction process performed by an image generating
  • FIG. 1 shows an example of the configuration of an image display system to which this embodiment can be applied.
  • the image display system 1 includes client terminals 10a, 10b, 10c that display images in response to user operations, and a content server 20 that provides data used for display.
  • Input devices 14a, 14b, 14c for user operations and display devices 16a, 16b, 16c that display images are connected to the client terminals 10a, 10b, 10c, respectively.
  • the client terminals 10a, 10b, 10c and the content server 20 can establish communication via a network 8 such as a WAN (World Area Network) or a LAN (Local Area Network).
  • a network 8 such as a WAN (World Area Network) or a LAN (Local Area Network).
  • the client terminals 10a, 10b, 10c may be connected to the display devices 16a, 16b, 16c and the input devices 14a, 14b, 14c either wired or wirelessly. Alternatively, two or more of these devices may be formed integrally.
  • the client terminal 10b is connected to a head-mounted display, which is the display device 16b.
  • the head-mounted display can change the field of view of the displayed image according to the movement of the user wearing it on the head, so it also functions as the input device 14b.
  • the client terminal 10c is a portable terminal that is integrated with the display device 16c and the input device 14c, which is a touchpad that covers the screen of the display device 16c. In this way, there are no limitations on the external shape or connection form of the illustrated devices. There are also no limitations on the number of client terminals 10a, 10b, 10c and content servers 20 that are connected to the network 8. Hereinafter, the client terminals 10a, 10b, 10c will be collectively referred to as client terminals 10, the input devices 14a, 14b, 14c as input device 14, and the display devices 16a, 16b, 16c as display device 16.
  • the input device 14 may be any one or a combination of general input devices such as a controller, keyboard, mouse, touchpad, joystick, or various sensors such as a motion sensor or camera equipped in a head mounted display, and supplies the contents of user operations to the client terminal 10.
  • the display device 16 may be any general display such as a liquid crystal display, plasma display, organic EL display, wearable display, or projector, and displays images output from the client terminal 10.
  • the content server 20 provides data of content accompanied by image display to the client terminal 10.
  • the type of content is not particularly limited, and may be any of electronic games, decorative images, web pages, video chat using avatars, etc.
  • the content server 20 sequentially obtains information on user operations on the input device 14 from the client terminal 10, reflects this information in the world to be displayed, and transmits the necessary data so that an image representing this is displayed on the client terminal 10 side.
  • FIG. 2 shows the internal circuit configuration of the client terminal 10.
  • the client terminal 10 includes a CPU (Central Processing Unit) 122, a GPU (Graphics Processing Unit) 124, and a main memory 126. These components are interconnected via a bus 130. An input/output interface 128 is also connected to the bus 130.
  • CPU Central Processing Unit
  • GPU Graphics Processing Unit
  • a communication unit 132 consisting of a peripheral device interface such as a USB or a network interface for a wired or wireless LAN, a storage unit 134 such as a hard disk drive or non-volatile memory, an output unit 136 that outputs data to the display device 16, an input unit 138 that inputs data from the input device 14, and a recording medium drive unit 140 that drives a removable recording medium such as a magnetic disk, optical disk, or semiconductor memory.
  • the CPU 122 executes an operating system stored in the storage unit 134 to control the entire client terminal 10.
  • the CPU 122 also executes various programs that have been read from a removable recording medium and loaded into the main memory 126, or downloaded via the communication unit 132.
  • the GPU 124 performs drawing processing in accordance with drawing commands from the CPU 122, and stores the display image in a frame buffer (not shown).
  • the GPU 124 then converts the display image stored in the frame buffer into a video signal and outputs it to the output unit 136.
  • the main memory 126 is composed of RAM (Random Access Memory), and stores programs and data necessary for processing.
  • the content server 20 may also have a similar internal circuit configuration.
  • Figure 3 shows the basic flow of image processing in this embodiment in comparison with the prior art.
  • the main display target is a three-dimensional world in which various objects exist.
  • the state of the world changes according to the provisions of the program or the user's operation.
  • the content server constantly acquires information on the content of the user's operation, the position of the viewpoint relative to the displayed world, and the direction of the line of sight.
  • the entire three-dimensional space to be displayed is called the "display world”
  • the state of the displayed world within or near the display field of view is called the "scene”.
  • the position of the viewpoint and the direction of the line of sight relative to the scene may also be collectively referred to simply as the "viewpoint".
  • the viewpoint may be manually operated by the user via the input device 14, or may be derived from the movement of the user's head using a motion sensor provided in the head-mounted display.
  • the content server draws the display image 200 in a field of view corresponding to the viewpoint information while changing the scene in response to user operations.
  • the content server generates the display image 200 using well-known computer graphics drawing techniques such as ray tracing and rasterization.
  • the content server transmits the generated display image 200 to the client terminal, which displays it on a display device. By repeating the illustrated process at a specified frame rate, a moving image showing the change in the scene in response to user operations, etc. is displayed on the client terminal side.
  • the content server 20 also acquires information on the content of user operations, the position of the viewpoint relative to the displayed world, and the direction of the gaze as needed, and similarly renders images while changing the scene accordingly. Meanwhile, in this embodiment, the content server 20 uses the images as training images 202 and as training data for machine learning. The content server 20 collects the training images 202 and performs machine learning to generate 3D scene information 204 that represents three-dimensional information about the scene.
  • NeRF Neural Radiance Fields
  • MLP multilayer perceptron
  • This data is a neural network that takes five-dimensional parameters consisting of position coordinates (x, y, z) and directional vector d ( ⁇ , ⁇ ) in three-dimensional space as input, and outputs volume density ⁇ and color information c (RGB) of the three primary colors.
  • the data of this neural network is called "3D scene information.”
  • various improvement methods have been proposed for NeRF, and the specific method is not particularly limited in this embodiment. Furthermore, it is not intended to limit the machine learning method to NeRF.
  • the content represented by the learning image 202, and therefore the 3D scene information 204 can change from moment to moment.
  • the figure shows the situation in which 3D scene information 204 of a scene at a certain time, or at an infinitesimal time that can be considered as a certain time, is generated.
  • a scene at a certain time, or at an infinitesimal time that can be considered as a certain time may be expressed as "one scene”. Note that if there is no movement in the displayed world, it can be treated as "one scene” regardless of time.
  • the content server 20 updates the 3D scene information 204 to correspond to the next scene.
  • the generation and updating of 3D scene information may be collectively referred to as "obtaining" 3D scene information.
  • the content server 20 may collect the learning images 202, for example, in the following manner. (1) Generate viewpoints suitable for learning around the viewpoint that defines the field of view of the image that is actually displayed, and generate corresponding images. (2) Reuse images displayed on the devices of multiple users viewing the same scene.
  • the viewpoint that defines the actual display field of view in the client terminal 10 is called the "display viewpoint” to be distinguished from the "learning viewpoint” that is set when generating a learning image.
  • the content server 20 may implement only one of (1) and (2), or may implement both. For example, a viewpoint that is missing due to (2) may be supplemented by (1).
  • the content server 20 transmits 3D scene information 204, which is the result of the learning, to the client terminal 10.
  • the client terminal 10 generates a display image 206 using the transmitted 3D scene information 204.
  • the client terminal 10 can display the scene as it is viewed from any viewpoint with high quality, with a relatively low load.
  • NeRF is applied, the client terminal 10 generates a ray r that passes through each pixel on the view screen from the display viewpoint, and uses volume rendering to integrate the color along that direction to determine the pixel value C(r) of the display image as follows:
  • the content server 20 repeats the process shown in the figure for the movement of the scene, updating the 3D scene information 204 at a predetermined rate and sequentially transmitting it to the client terminal 10.
  • the client terminal 10 generates a display image 206 while updating the 3D scene information it uses, thereby making it possible to represent moving images from any viewpoint that have the same changes as the learning images 202 generated by the content server 20.
  • the client terminal 10 can display an image with little delay in response to changes in viewpoint by drawing a display image 206 that shows the scene as seen from the viewpoint immediately before display using the latest 3D scene information 204.
  • FIG. 4 shows the functional block configuration of the client terminal 10 and the content server 20 in this embodiment.
  • the functions of the components in the illustrated functional blocks may be realized in circuitry or processing circuitry including general purpose processors, application specific processors, integrated circuits, ASICs (Application Specific Integrated Circuits), a CPU (a Central Processing Unit), conventional circuits, and/or combinations thereof, configured or programmed to realize the functions described herein.
  • a processor is considered to be a circuitry or processing circuitry including transistors and other circuits.
  • a processor may be a programmed processor that executes programs stored in memory.
  • a circuit, unit, or means is hardware that is programmed to realize or executes a described function.
  • the hardware may be any hardware disclosed in this specification or any hardware that is programmed to realize or known to execute the described function. If the hardware is a processor, which is considered to be a type of circuit, the circuit, means, or unit is a combination of hardware and software used to configure the hardware and/or processor.
  • the client terminal 10 includes an input information acquisition unit 50 that acquires input information such as user operations, a 3D scene information acquisition unit 52 that acquires 3D scene information data from the content server 20, a 3D scene information storage unit 54 that stores the acquired 3D scene information data, an image generation unit 56 that generates a display image, and an output unit 58 that outputs the display image data.
  • the input information acquisition unit 50 acquires the contents of user operations from the input device 14 at any time. User operations include the selection and activation of content, and command input for content currently being executed.
  • the input information acquisition unit 50 also acquires information on the display viewpoint for the displayed world from the input device 14 or head-mounted display at any time or at a specified time interval.
  • Information for detecting the position and posture of the head of a user wearing a head-mounted display and acquiring viewpoint information based on this is well known, and this may also be applied in this embodiment.
  • the input information acquisition unit 50 supplies the acquired information to the content server 20 and the image generation unit 56 as appropriate.
  • the 3D scene information acquisition unit 52 sequentially acquires data of 3D scene information, which is continuously updated, from the content server 20.
  • the 3D scene information acquisition unit 52 acquires multiple types of 3D scene information representing one scene from the content server 20.
  • the 3D scene information storage unit 54 stores the data of the multiple types of 3D scene information acquired by the 3D scene information acquisition unit 52.
  • the 3D scene information acquisition unit 52 acquires new 3D scene information, it updates the data of the same type of 3D scene information stored in the 3D scene information storage unit 54.
  • the image generation unit 56 uses the 3D scene information data most recently stored in the 3D scene information storage unit 54 to draw a display image at a predetermined frame rate.
  • the image generation unit 56 acquires the latest display viewpoint from the input information acquisition unit 50, and draws an image in the corresponding field of view using a technique such as the volume rendering described above. If 3D scene information corresponding to the transition of the scene can be prepared using machine learning, then even if the client terminal 10 generates the display image, it is possible to draw a high-quality image with a lighter load than with normal processing such as ray tracing.
  • the image generation unit 56 changes the display image representing one scene in time, space, or both by drawing using multiple types of 3D scene information. For example, in a mode in which the content server 20 transmits 3D scene information data in ascending order of information density, the image generation unit 56 updates the display image using the transmitted 3D scene information in sequence. This allows an image representing one scene to be displayed with low latency, and the visual image quality can be maintained by gradually increasing the resolution.
  • the output unit 58 outputs the image drawn by the image correction unit 92 to the display device 16 at a predetermined rate for display.
  • the content server 20 includes an input information acquisition unit 70 that acquires input information from the client terminal 10, a learning viewpoint generation unit 72 that generates a learning viewpoint, a display world control unit 74 that controls the display world, a 3D model storage unit 76 that stores 3D models of objects, a learning image generation unit 78 that generates learning images, a type-specific 3D scene information acquisition unit 80 that acquires data on multiple types of 3D scene information, a 3D scene information storage unit 84 that stores the acquired data on the 3D scene information, and a 3D scene information transmission unit 86 that transmits the data on the 3D scene information to the client terminal 10.
  • an input information acquisition unit 70 that acquires input information from the client terminal 10
  • a learning viewpoint generation unit 72 that generates a learning viewpoint
  • a display world control unit 74 that controls the display world
  • a 3D model storage unit 76 that stores 3D models of objects
  • a learning image generation unit 78 that generates learning images
  • a type-specific 3D scene information acquisition unit 80 that acquires
  • the input information acquisition unit 70 acquires the contents of user operations and viewpoint information from the client terminal 10 at any time or at a specified time interval.
  • the learning viewpoint generation unit 72 generates multiple learning viewpoints for generating learning images.
  • the learning viewpoint generation unit 72 generates learning viewpoints around the latest display viewpoint acquired by the input information acquisition unit 70 according to a specified rule. For example, the learning viewpoint generation unit 72 places a specified number of learning viewpoints evenly inside a sphere of a specified radius centered on the latest display viewpoint.
  • the specified radius is a value obtained by multiplying the maximum speed expected for viewpoint movement by the longest time until display using the data is performed, for example.
  • the learning viewpoint generating unit 72 is not limited to distributing the learning viewpoints evenly, but may distribute more learning viewpoints in a range in which the viewpoint is expected to move, depending on the situation in the displayed world, etc.
  • the learning viewpoint generating unit 72 may also set line of sight in a predetermined number of directions evenly for the viewpoint at each position, or may set more line of sight in directions in which the line of sight is expected to move.
  • the learning viewpoint generating unit 72 may also use the latest display viewpoint itself as a learning viewpoint.
  • the display world control unit 74 controls the three-dimensional display world represented as content according to the content of the user operations acquired by the input information acquisition unit 70. For example, if the content is an electronic game, the display world control unit 74 places necessary objects such as user characters in the virtual space where the electronic game takes place, and gives them movement according to commands entered by the user and program specifications.
  • the three-dimensional model storage unit 76 stores three-dimensional models of objects that exist in the display world, and the display world control unit 74 reads them out as appropriate to use in constructing the display world.
  • the learning image generation unit 78 generates images of the scene as viewed from multiple learning viewpoints generated by the learning viewpoint generation unit 72 as learning images.
  • the learning image generation unit 78 preferably generates learning images using a technique capable of rendering high-quality images, such as ray tracing.
  • the type-specific 3D scene information acquisition unit 80 uses the learning images generated by the learning image generation unit 78 to generate 3D scene information through machine learning as described above, and updates the information to correspond to changes in the scene.
  • the type-specific 3D scene information acquisition unit 80 acquires multiple types of 3D scene information representing one scene.
  • the type-specific 3D scene information acquisition unit 80 acquires multiple pieces of 3D scene information in which the spatial density of the represented information differs from one another.
  • the spatial density of the information held by the 3D scene information is referred to as "information density.”
  • Information density may also be referred to as the resolution of the information held by the 3D scene information, or the spatial frequency of the information.
  • the vectors input to a neural network during training are converted into vectors in a high-dimensional space that includes high frequencies using Positional Encoding, which allows the high-frequency components of the output vector to be represented more accurately.
  • the function ⁇ used for the conversion is expressed as follows:
  • the type-specific 3D scene information acquisition unit 80 acquires multiple pieces of 3D scene information with different information densities in parallel by setting multiple parameters L and performing machine learning individually.
  • the means for controlling the information density of the 3D scene information is not limited to this.
  • Multiresolution Hash Encoding which sets a grid with multiple resolutions and expresses an input vector based on the positional relationship with its vertices (see, for example, Thomas Muller et al., "Instant Neural Graphics Primitives with a Multiresolution Hash Encoding," ACM Transactions on Graphics, July 2022, Vol. 41, No. 4, Article 102, pp. 1-15).
  • the type-specific 3D scene information acquisition unit 80 can acquire multiple pieces of 3D scene information with different information densities in parallel by setting multiple levels L corresponding to the number of grid resolutions and performing machine learning individually. In these cases, multiple values to be set as the parameter L are prepared in the internal memory of the type-specific 3D scene information acquisition unit 80.
  • the type-specific 3D scene information acquisition unit 80 then sequentially uses multiple learning images generated for one scene by the learning image generation unit 78 to acquire multiple pieces of 3D scene information with different information densities in parallel, and stores them in the 3D scene information storage unit 84 or updates the same type of 3D scene information that has already been stored.
  • the type-specific 3D scene information acquisition unit 80 may set the level number L to only one and learn, and when transmitting to the client terminal 10, etc., select grids of different levels (e.g., grids of levels 0 to 1, grids of levels 0 to 2, ..., grids of levels 0 to L-1, etc.) individually to read out data. In this case, too, the learning is completed more quickly for lower level grids, i.e., grids with lower information density, and the effects described below are the same.
  • the types of 3D scene information acquired by the type-specific 3D scene information acquisition unit 80 are not limited to distinctions in information density.
  • the type-specific 3D scene information acquisition unit 80 may acquire multiple pieces of 3D scene information with different areas or objects to be learned in the display world. In this case, the smaller the range to be learned in the display world, the faster the learning is completed, and therefore the faster the 3D scene information is updated.
  • the type-specific 3D scene information acquisition unit 80 records the time of the reflected scene in association with each of the multiple types of 3D scene information stored in the 3D scene information storage unit 84. Note that the type-specific 3D scene information acquisition unit 80 may acquire multiple types of 3D scene information with different combinations of information density and range of the learning target.
  • the 3D scene information transmission unit 86 transmits the latest 3D scene information data stored in the 3D scene information storage unit 84 to the client terminal 10.
  • the 3D scene information transmission unit 86 transmits to the client terminal 10 3D scene information in order of completion of learning for one scene. For example, when data of 3D scene information with different information densities is to be transmitted, learning is completed in order from the 3D scene information with the lowest information density, as described above. Therefore, the 3D scene information transmission unit 86 transmits data of the 3D scene information with the lowest information density when learning of that information is completed, and transmits data of the 3D scene information with the next lowest information density when learning of that information is completed, and so on, gradually transmitting up to the 3D scene information with the highest information density.
  • the 3D scene information transmission unit 86 includes a division unit 88 inside.
  • the division unit 88 divides each of the multiple types of 3D scene information representing one scene into multiple data.
  • Each type of 3D scene information representing one scene is composed of an individual neural network.
  • the division unit 88 divides each neural network into multiple neural networks.
  • division means removing some of the nodes that are different from each other while maintaining the structure of the nodes that are associated by a hash table or the like. The nodes to be removed in the neural network after division are determined randomly.
  • dropout To alleviate overfitting, a technique called dropout is known, in which some of the nodes in a neural network are randomly deactivated (see, for example, Nitish Srivastava et al., "Dropout: A Simple Way to Prevent Neural Networks from Overfitting," Journal of Machine Learning Research, June 2014, Vol. 15, pp. 1919-1958). As is clear from dropout, the accuracy of the learning results of a neural network can be maintained at a certain level even if some of the nodes are deactivated.
  • the client terminal 10 can generate a display image with relatively high accuracy even when using 3D scene information with some of the nodes removed. Therefore, the content server 20 can transmit multiple neural networks representing the same 3D scene information with some of the nodes removed, thereby suppressing an increase in data size and increasing robustness against packet loss. If there is no packet loss, the image generation unit 56 of the client terminal 10 can restore the neural network before division and draw the display image. If there is packet loss, the image generation unit 56 draws the display image using the neural network that it has acquired. As described above, it is possible to generate a display image in this case as well.
  • the image generation unit 56 of the client terminal 10 updates the display image using the newly acquired 3D scene information. Since the resolution of the display image corresponds to the information density of the 3D scene information, the resolution of the display image gradually increases by acquiring 3D scene information in order starting with the 3D scene information with the lowest information density.
  • 3D scene information with low information density has a smaller number of samples in volume rendering, and therefore the drawing speed of the display image increases.
  • the time from the start of learning each scene to its display can be significantly shortened.
  • high-definition images can be displayed using high-information-density 3D scene information, making it possible to achieve low-latency display while minimizing the impact on appearance.
  • the types of 3D scene information are not limited to distinctions in resolution.
  • the content server 20 may learn only the partial area that the user is gazing at in a short time and transmit it first, and then transmit the 3D scene information for the entire scene later. In this case as well, a similar principle can be used to display an image in which the area being gazed at is displayed with low latency and updated to track the surrounding image.
  • the 3D scene information transmitting unit 86 may transmit 3D scene information representing one scene to the client terminal 10 using different communication protocols depending on the type of 3D scene information.
  • the 3D scene information transmitting unit 86 may transmit 3D scene information for an area where image detail should be prioritized using the highly reliable TCP/IP (Transmission Control Protocol/Internet Protocol), and transmit 3D scene information for an area where low latency in movement should be prioritized using UDP (User Datagram Protocol), which has a high transfer rate.
  • TCP/IP Transmission Control Protocol/Internet Protocol
  • UDP User Datagram Protocol
  • FIG. 5 shows a schematic diagram of the procedure by which the content server 20 acquires 3D scene information.
  • the display world control unit 74 constructs a display world 210 in which, for example, an enemy character 212 exists.
  • the display world 210 may be in motion, but here, the display world 210 corresponding to one scene is shown.
  • the learning viewpoint generation unit 72 generates multiple learning viewpoints based on the latest display viewpoint, etc.
  • the learning image generation unit 78 generates learning images 214a, 214b, 214c, etc. corresponding to each learning viewpoint. Note that there is no limit to the number of learning images to be generated.
  • the type-specific 3D scene information acquisition unit 80 performs machine learning using the learning images 214a, 214b, 214c, etc., and updates multiple types of 3D scene information.
  • the substance of each piece of 3D scene information is a neural network 216a, 216b, 216c, ....
  • the neural networks 216a, 216b, 216c, ... differ in at least one of the information density and the range they represent.
  • the type-specific 3D scene information acquisition unit 80 performs learning, for example, by setting the above-mentioned parameter L for each. In this case, the lower the information density of the 3D scene information, the quicker learning is completed.
  • the type-specific 3D scene information acquisition unit 80 performs learning by, for example, cutting out the corresponding area from the learning images 214a, 214b, 214c.
  • the narrower the range of the 3D scene information represented the sooner the learning is completed.
  • the time until learning is completed varies depending on the balance. Qualitatively, the higher the information density and the wider the range represented, the slower the learning is completed, so the balance between information density and range is optimized to obtain an appropriate delay time.
  • the neural networks 216a, 216b, 216c, ... are each updated in accordance with the changes in the scene.
  • FIG. 6 is a diagram for explaining the manner in which 3D scene information of different ranges is acquired. Assume that a part of the scene of the display world 210 shown in FIG. 5 is displayed as the display image 222.
  • the type-specific 3D scene information acquisition unit 80 for example, separately acquires 3D scene information representing a range 226 in the display world 210 corresponding to an important area 224 on the display image 222, and 3D scene information representing the entire scene including the outside of that range.
  • the important area 224 here is, for example, an area within a predetermined range from the user's gaze point, an area within a predetermined range from the center of the display image, an area showing the battle situation or acquired items, an area where main objects such as enemy characters and user characters exist, etc., and the rules for selecting the area are set in advance according to the content of the content, etc.
  • the type-specific 3D scene information acquisition unit 80 may directly identify the main objects themselves that exist in the display world 210, or a range of a predetermined size that includes the objects, and acquire 3D scene information as the learning target.
  • a well-known gaze point detector is provided in the client terminal 10.
  • the input information acquisition unit 70 of the content server 20 then acquires gaze point information from the client terminal 10 at a predetermined rate and determines the learning range individually.
  • the type-specific 3D scene information acquisition unit 80 may separately learn the range of the display world corresponding to the display image 222 being displayed on the client terminal 10 and a wider range including the outside of that range. In any case, the type-specific 3D scene information acquisition unit 80 performs machine learning by extracting corresponding partial areas from the multiple learning images generated by the learning viewpoint generation unit 72, and also performs machine learning of a wider range by using the entire learning images, for example.
  • the number of variations in ranges represented by the 3D scene information is not limited to two, and may be three or more.
  • the inclusion relationship of the ranges is not limited, and 3D scene information may be obtained for each of multiple independent ranges. In any case, 3D scene information is obtained for the number of set ranges.
  • the narrower the range the shorter the learning time and drawing time, making it possible to display with low latency.
  • By utilizing this characteristic even if the information density of the 3D scene information corresponding to the important area 224 is increased to a certain extent, if the range is narrow, it is possible to complete drawing in the same amount of time as drawing a wide-area image using 3D scene information with low information density. As a result, it becomes possible to display the important area 224 with high definition and low latency.
  • Figure 7 shows a schematic diagram of data transitions during transmission and reception of 3D scene information.
  • the diagram shows the passage of time from top to bottom, with (a) to (c) showing data state transitions in the content server 20 and (c) to (e) showing data state transitions in the client terminal 10.
  • This diagram also shows 3D scene information representing one scene.
  • the type-specific 3D scene information acquisition unit 80 of the content server 20 acquires multiple neural networks 230a, 230b, ... each corresponding to multiple types of 3D scene information, based on a newly generated learning image, as shown in (a).
  • the 3D scene information transmission unit 86 of the content server 20 divides each of the neural networks 230a, 230b as shown in (b). That is, the 3D scene information transmission unit 86 generates a plurality of neural networks 232a, 232b from the neural network 230a in which some mutually different nodes have been removed. The 3D scene information transmission unit 86 also generates a plurality of neural networks 234a, 234b from the neural network 230b in which some mutually different nodes have been removed.
  • the nodes that have been removed from the original neural networks 230a and 230b of the divided neural networks 232a, 232b, 234a, and 234b are shown with dotted lines. If, after dividing a neural network 230a, the neural networks 232a and 232b are obtained and the nodes that have been removed in one neural network 232a are left in the other neural network 232b, then by combining the two, the original neural network 230a can be completely restored.
  • the number of divisions of the neural network is not limited to two.
  • the division unit 88 of the 3D scene information transmission unit 86 preferably randomly determines nodes to be excluded using a method similar to dropout.
  • the 3D scene information transmission unit 86 packetizes the divided neural networks 232a, 232b, 234a, and 234b and transmits them to the client terminal 10.
  • the 3D scene information transmission unit 86 randomly rearranges the transmission order of packets from multiple neural networks, as shown in (c).
  • the transmission order is shown as being arranged horizontally, with neural networks 232b, 234b, 232a, ....
  • the rearrangement is performed between neural networks that have been updated in parallel within a certain allowable time. This prevents unnecessary delays in the rendering process due to the rearrangement.
  • the 3D scene information acquisition unit 52 of the client terminal 10 acquires the packets in sequence, it reconstructs the neural network before division by returning the order of the extracted neural networks to their original order, as shown in (d). For this reason, the content server 20 adds metadata to each of the neural networks 232b, 234b, 232a, ... to be transmitted, indicating which of the original neural networks 230a, 230b, ... it has divided.
  • the 3D scene information acquisition unit 52 is able to acquire the neural networks 232a and 232b, and is therefore able to completely restore the original neural network 230a.
  • the image generation unit 56 of the client terminal 10 uses the acquired neural networks 232a, 232b, and 234b to draw the display image 238, as shown in (e).
  • the image generation unit 56 draws an image of the scene seen from the latest display viewpoint by volume rendering using the neural networks 232a, 232b, and 234b.
  • the image generation unit 56 updates at least a portion of the display image 238 using the latest 3D scene information. This makes it possible to realize a display in which the image resolution gradually increases and important images move with particularly low latency.
  • Figure 8 is a diagram for explaining the temporal relationship between machine learning in the content server 20 and image drawing in the client terminal 10.
  • the horizontal direction of the figure is the time axis, with the learning time in the content server 20 and the drawing time of each frame of the displayed image in the client terminal 10 each shown as a rectangle.
  • the number attached to each drawing time rectangle indicates the frame order.
  • the type-specific 3D scene information acquisition unit 80 of the content server 20 learns the first, second, ... nth types of 3D scene information individually.
  • the ordinal numbers correspond to the information density, with the first type being the lowest information density and the nth type being the highest information density.
  • the content server 20 transmits it to the client terminal 10, as shown by arrow A1.
  • the client terminal 10 uses the first type of 3D scene information to draw images of frames numbered 0, 1, and 2 at the lowest resolution from time t1. Meanwhile, when the content server 20 has completed learning the second type of 3D scene information, it transmits it to the client terminal 10, as indicated by arrow A2. The client terminal 10 begins drawing frames numbered 3 immediately after time t2 when the second type of 3D scene information is acquired, and draws images at a resolution corresponding to the second type of information density. By repeating the same process, the frame resolution gradually increases.
  • the content server 20 Once the content server 20 has completed learning the nth type of 3D scene information, it transmits it to the client terminal 10 as indicated by arrow An.
  • the client terminal 10 begins drawing the image at the highest resolution corresponding to the nth type of information density, starting with frame number m+2, which begins drawing immediately after time tn when the 3D scene information is acquired.
  • frame number m+2 which begins drawing immediately after time tn when the 3D scene information is acquired.
  • multiple types of 3D scene information representing the same scene are learned, and the 3D scene information that is completed learning the earliest is immediately transmitted to the client terminal 10. This allows the frame drawing start time to be significantly earlier than when only 3D scene information with a high information density, such as the nth type, is transmitted.
  • drawing can start with a delay of only the time Ta for data transmission and the time Tb required for learning the first type of 3D scene information.
  • drawing of a frame may be performed using the 3D scene information of the previous scene transmitted immediately before.
  • the image generating unit 56 performs volume rendering based on the most recent display viewpoint acquired immediately before. This not only increases the resolution from frame numbers 0 to m+2, but also makes it possible to display images that reflect the movement of the viewpoint with low delay.
  • the lower the information density of the first type or other 3D scene information the fewer the number of samples in volume rendering and the higher the drawing speed, making it possible to display with less delay.
  • image generation unit 56 can be said to have the function of correcting an image that has already been generated so that it matches the latest display viewpoint, regardless of whether it has already been displayed.
  • Figure 9 is a diagram for explaining the image correction process by image generation unit 56 of client terminal 10.
  • Image 250a is a frame or part of a display image, and shows image 252a of a cylindrical object in the foreground. If the position of the cylindrical object changes relatively due to movement of the viewpoint, and object image 252b shifts in image 250b after the lapse of time ⁇ t, it becomes necessary to newly draw background area 254 that was hidden in image 250a.
  • images 250a and 250c are images that the client terminal 10 has drawn using the latest 3D scene information at that time.
  • the content server 20 transmits the display image together with the 3D scene information, it is also possible for the client terminal 10 to correct image 250a transmitted from the content server 20 using the 3D scene information.
  • the client terminal 10 generates an image 250c that has been corrected to reflect the movement of the viewpoint that occurs during the time difference ⁇ t between when the image 250a is generated on the content server 20 side and when it is displayed on the client terminal 10.
  • the image generating unit 56 of the client terminal 10 redraws the area 254 that is not shown in the image 250a generated by the content server 20 and the image 252b whose color has changed, using the latest 3D scene information.
  • the image generation unit 56 since the high-quality image 250a generated by the content server 20 is to be combined with the area newly drawn by the client terminal 10, it is desirable for the image generation unit 56 to draw the necessary area using 3D scene information with high information density. Note that the image generation unit 56 can omit new drawing processing using 3D scene information in a situation where there is little sense of incongruity simply by moving or deforming the image in the image 250a transmitted from the content server 20. For example, for distant objects where the amount of image shift or color change is small in response to changes in the display viewpoint, the image generation unit 56 may perform correction by directly processing the image 250a.
  • the image generation unit 56 stores in advance in an internal memory or the like rules for determining areas that need to be redrawn using 3D scene information. For example, when the speed of the display viewpoint exceeds a threshold value, or is predicted to exceed it, the image generation unit 56 may use 3D scene information to redraw objects whose distance from the viewpoint is less than or equal to the threshold value, and their surrounding areas. In this case, the content server 20 also transmits geometry information of the display world to the client terminal 10. This allows the image generation unit 56 to obtain changes in the distance between the display viewpoint and the object.
  • the content server 20 generates learning images that represent the display world of the content, and acquires multiple types of 3D scene information for each scene.
  • the content server 20 sequentially transmits the 3D scene information for which learning has been completed to the client terminal 10, and the client terminal 10 uses it to draw frames of the display image corresponding to the latest display viewpoint.
  • various types of 3D scene information such as the density of information and the width of the range to be represented, the learning time can be changed, and the delay time until display can also be controlled.
  • the level of detail can be increased in some areas, making it possible to display high-quality images with low latency that take into account the importance of the image, etc.
  • the content server 20 also divides the neural network that constitutes one piece of 3D scene information into multiple neural networks and transmits them to the client terminal 10. This makes it possible to improve robustness against packet loss while suppressing increases in data size. As a result, in image processing involving distribution from the content server 20, the impact of distribution can be reduced, improving the quality of the user experience.
  • the present invention can be used in various information processing devices such as content servers, game devices, head-mounted displays, display devices, mobile terminals, and personal computers, as well as image display systems that include any of these.
  • a content server comprising: The circuitry includes: Generate learning images that represent scenes from multiple viewpoints of a three-dimensional display world in which the situation changes in response to user operations; acquiring a plurality of types of 3D scene information representing three-dimensional information of each scene by machine learning using the learning images as training data; Transmitting the plurality of types of 3D scene information data to a client terminal that draws a display image using the 3D scene information in the order in which machine learning for each scene is completed; Content server.
  • the circuitry includes: 2. The content server according to item 1, wherein a plurality of pieces of 3D scene information are acquired, each having a different spatial information density.
  • the circuitry includes: A content server according to item 1, which divides a neural network constituting one of the 3D scene information into multiple neural networks from which some nodes that are different from each other are excluded, and transmits the multiple neural networks to the client terminal.
  • a content server according to item 1 which divides a neural network constituting one of the 3D scene information into multiple neural networks from which some nodes that are different from each other are excluded, and transmits the multiple neural networks to the client terminal.
  • the circuitry packetizes the neural network after division and randomly changes the transmission order of the packets.
  • the circuitry obtains a plurality of pieces of 3D scene information representing different areas in the display world.
  • the circuitry obtains the 3D scene information representing a range of the display world that corresponds to a defined area in an image displayed on the client terminal.
  • a client terminal comprising: The circuitry includes: acquiring information on a user operation and information on a display viewpoint for a three-dimensional display world; Acquire from a server a plurality of types of 3D scene information data representing three-dimensional information acquired by machine learning for each scene in the displayed world whose situation changes in response to the user operation; Rendering at least a portion of a frame of a display image using the most recently acquired 3D scene information based on an updated viewing viewpoint.
  • Client terminal [Item 10] The client terminal according to item 9, wherein the circuitry obtains a plurality of neural networks obtained by dividing a neural network that constitutes one of the 3D scene information, with some nodes that are different from each other being excluded, and reconstructs the neural network before division.
  • a client terminal that displays an image of a three-dimensional display world in which a situation changes in response to a user operation, and a content server that transmits data used to generate the display image
  • the client terminal and the content server having a circuitry configured to:
  • the circuitry of the content server includes: generating images representing each scene of the display world as seen from a plurality of viewpoints as learning images; By machine learning using the learning images as training data, a plurality of types of 3D scene information representing three-dimensional information of each scene is obtained; Transmitting the plurality of types of 3D scene information data to the client terminal in the order in which machine learning for each scene is completed;
  • the circuitry of the client terminal includes: acquiring information on the user operation and information on a display viewpoint for the display world; acquiring the plurality of types of 3D scene information data from the content server; Rendering at least a portion of a frame of a display image using the most recently acquired 3D scene information based on an updated viewing viewpoint.
  • Image display system Generate learning images that represent scenes from multiple viewpoints of a three-dimensional display world in which the situation changes in response to user operations; By machine learning using the learning images as training data, a plurality of types of 3D scene information representing three-dimensional information of each scene is obtained; Transmitting the plurality of types of 3D scene information data to a client terminal that draws a display image using the 3D scene information in the order in which machine learning for each scene is completed; A method for transmitting data for display.
  • a function for acquiring information on user operations and information on a display viewpoint for a three-dimensional display world A function of acquiring from a server a plurality of types of 3D scene information data representing three-dimensional information acquired by machine learning for each scene in the displayed world whose situation changes in response to the user operation; Rendering at least a portion of a frame of a display image using the most recently acquired 3D scene information based on an updated display viewpoint;

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • General Physics & Mathematics (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Computing Systems (AREA)
  • Medical Informatics (AREA)
  • Mathematical Physics (AREA)
  • Evolutionary Computation (AREA)
  • Computer Graphics (AREA)
  • Computer Hardware Design (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Artificial Intelligence (AREA)
  • Processing Or Creating Images (AREA)

Abstract

This content server 20 acquires neural networks 230a, 23b,... representing information about a plurality of types of 3D scenes by generating a learning image representing each scene in a display world and performing machine learning. The content server 20 divides the neural networks 230a, 23b,... to generate a plurality of neural networks 232a, 232b, etc., randomly switches the order of packets, and transmits the packets to a client terminal 10. The client terminal 10 draws a display image 238 corresponding to the most recent display viewpoint by, for example, returning the divided neural networks 232a, 232b, etc. to the original neural network 230a.

Description

コンテンツサーバ、クライアント端末、画像表示システム、表示用データ送信方法、および表示画像生成方法Content server, client terminal, image display system, display data transmission method, and display image generation method

 この発明は、3次元の表示世界の画像を表示させるコンテンツサーバ、クライアント端末、画像表示システム、表示用データ送信方法、および表示画像生成方法に関する。 This invention relates to a content server, a client terminal, an image display system, a display data transmission method, and a display image generation method that display images of a three-dimensional display world.

 近年の通信網の拡充や画像処理技術の発展により、多様な電子コンテンツを視聴環境によらず楽しむことができるようになってきた。例えば電子ゲームの分野では、個々のクライアント端末に入力された操作情報をサーバが収集し、それらを随時反映させたゲーム画像を配信することで、複数のプレイヤが場所を問わず同一のゲームに参加できるシステムが普及している。  With the recent expansion of communication networks and developments in image processing technology, it has become possible to enjoy a wide variety of electronic content regardless of the viewing environment. For example, in the field of electronic games, a system has become widespread in which a server collects operation information entered into each client terminal and distributes game images that reflect this information as needed, allowing multiple players to participate in the same game regardless of location.

 電子ゲームに限らず、ユーザ操作に応じてリアルタイムで生成した動画像をサーバから配信する形式の電子コンテンツでは、サーバの潤沢な処理環境を利用できるため、クライアント端末の処理性能の影響を最小限に高品質な画像を表示しやすくなる。一方、クライアント端末からの操作情報の送信や、それを受けたサーバからのデータ伝送の処理が常に介在することにより、視点操作に対し表示が追随できなかったり、パケットロスにより画像が欠損したりして、ユーザ体験の質が低下する可能性がある。 Not limited to electronic games, electronic content in which video images generated in real time in response to user operations are distributed from a server can utilize the server's abundant processing environment, making it easier to display high-quality images while minimizing the impact of the client terminal's processing performance. On the other hand, the constant transmission of operation information from the client terminal and the data transmission processing from the server that receives it can cause the display to be unable to keep up with viewpoint operations or images to be lost due to packet loss, resulting in a deterioration in the quality of the user experience.

 本発明はこうした課題に鑑みてなされたものであり、その目的は、サーバからの配信を伴うコンテンツの画像処理において、配信によるユーザ体験の質への影響を軽減する技術を提供することにある。 The present invention was made in consideration of these problems, and its purpose is to provide a technology that reduces the impact of distribution on the quality of the user experience when processing images of content that is distributed from a server.

 上記課題を解決するために、本発明のある態様はコンテンツサーバに関する。このコンテンツサーバは、ユーザ操作に応じて状況が変化する3次元の表示世界の各シーンを、複数の視点から見た様子を表す画像を、学習用画像として生成する学習用画像生成部と、学習用画像を教師データとする機械学習により、各シーンの3次元情報を表す複数種類の3Dシーン情報を取得する種類別3Dシーン情報取得部と、3Dシーン情報を用いて表示画像を描画するクライアント端末に、複数種類の3Dシーン情報のデータを、シーンごとの機械学習が完了した順に送信する3Dシーン情報送信部と、を備えたことを特徴とする。 In order to solve the above problem, one aspect of the present invention relates to a content server. This content server is characterized by having a learning image generation unit that generates learning images that represent the state of each scene in a three-dimensional displayed world, where the situation changes in response to user operations, as viewed from multiple viewpoints; a type-specific 3D scene information acquisition unit that acquires multiple types of 3D scene information that represent the three-dimensional information of each scene through machine learning using the learning images as teacher data; and a 3D scene information transmission unit that transmits data on the multiple types of 3D scene information to a client terminal that uses the 3D scene information to draw a display image, in the order in which machine learning for each scene is completed.

 本発明の別の態様はクライアント端末に関する。このクライアント端末は、ユーザ操作の情報と、3次元の表示世界に対する表示用視点の情報とを取得する入力情報取得部と、ユーザ操作に応じて状況が変化する表示世界の各シーンに対し、機械学習により取得された、3次元情報を表す複数種類の3Dシーン情報のデータを、サーバから取得する3Dシーン情報取得部と、最新の表示用視点に基づき、直近に取得された3Dシーン情報を用いて、表示画像のフレームの少なくとも一部を描画する画像生成部と、を備えたことを特徴とする。 Another aspect of the present invention relates to a client terminal. This client terminal is characterized by having an input information acquisition unit that acquires information on user operations and information on a display viewpoint for a three-dimensional display world, a 3D scene information acquisition unit that acquires from a server data on multiple types of 3D scene information that represent three-dimensional information acquired by machine learning for each scene in the display world whose situation changes according to user operations, and an image generation unit that draws at least a portion of a frame of a display image using the most recently acquired 3D scene information based on the latest display viewpoint.

 本発明のさらに別の態様は画像表示システムに関する。この画像表示システムは、ユーザ操作に応じて状況が変化する3次元の表示世界の画像を表示させるクライアント端末と、表示画像の生成に用いるデータを送信するコンテンツサーバと、を含み、コンテンツサーバは、表示世界の各シーンを、複数の視点から見た様子を表す画像を、学習用画像として生成する学習用画像生成部と、学習用画像を教師データとする機械学習により、各シーンの3次元情報を表す複数種類の3Dシーン情報を取得する種類別3Dシーン情報取得部と、クライアント端末に、複数種類の3Dシーン情報のデータを、シーンごとの機械学習が完了した順に送信する3Dシーン情報送信部と、を備え、クライアント端末は、ユーザ操作の情報と、表示世界に対する表示用視点の情報とを取得する入力情報取得部と、複数種類の3Dシーン情報のデータを、コンテンツサーバから取得する3Dシーン情報取得部と、最新の表示用視点に基づき、直近に取得された3Dシーン情報を用いて、表示画像のフレームの少なくとも一部を描画する画像生成部と、を備えたことを特徴とする。 Another aspect of the present invention relates to an image display system. This image display system includes a client terminal that displays an image of a three-dimensional display world in which the situation changes according to user operations, and a content server that transmits data used to generate the display image. The content server includes a learning image generation unit that generates images showing how each scene in the display world looks from multiple viewpoints as learning images, a type-specific 3D scene information acquisition unit that acquires multiple types of 3D scene information that represent the three-dimensional information of each scene by machine learning using the learning images as teacher data, and a 3D scene information transmission unit that transmits the multiple types of 3D scene information data to the client terminal in the order in which machine learning for each scene is completed. The client terminal includes an input information acquisition unit that acquires information on user operations and information on a display viewpoint for the display world, a 3D scene information acquisition unit that acquires data on the multiple types of 3D scene information from the content server, and an image generation unit that draws at least a part of the frame of the display image using the most recently acquired 3D scene information based on the latest display viewpoint.

 本発明のさらに別の態様は表示用データ送信方法に関する。この表示用データ送信方法は、ユーザ操作に応じて状況が変化する3次元の表示世界の各シーンを、複数の視点から見た様子を表す画像を、学習用画像として生成するステップと、学習用画像を教師データとする機械学習により、各シーンの3次元情報を表す複数種類の3Dシーン情報を取得するステップと、3Dシーン情報を用いて表示画像を描画するクライアント端末に、複数種類の3Dシーン情報のデータを、シーンごとの機械学習が完了した順に送信するステップと、を含むことを特徴とする。 Another aspect of the present invention relates to a display data transmission method. This display data transmission method is characterized by including the steps of: generating, as learning images, images that represent how each scene in a three-dimensional display world, in which the situation changes in response to user operations, is viewed from multiple viewpoints; acquiring multiple types of 3D scene information that represent the three-dimensional information of each scene through machine learning using the learning images as teacher data; and transmitting data on the multiple types of 3D scene information to a client terminal that uses the 3D scene information to draw a display image, in the order in which machine learning for each scene is completed.

 本発明のさらに別の態様は表示画像生成方法に関する。この表示画像生成方法は、ユーザ操作の情報と、3次元の表示世界に対する表示用視点の情報とを取得するステップと、ユーザ操作に応じて状況が変化する表示世界の各シーンに対し、機械学習により取得された、3次元情報を表す複数種類の3Dシーン情報のデータを、サーバから取得するステップと、最新の表示用視点に基づき、直近に取得された3Dシーン情報を用いて、表示画像のフレームの少なくとも一部を描画するステップと、を含むことを特徴とする。 Another aspect of the present invention relates to a display image generating method, which is characterized by including the steps of: acquiring information on user operations and information on a display viewpoint for a three-dimensional display world; acquiring from a server data on multiple types of 3D scene information representing three-dimensional information acquired by machine learning for each scene in the display world whose situation changes according to the user operations; and drawing at least a part of a frame of a display image using the most recently acquired 3D scene information based on the latest display viewpoint.

 なお、以上の構成要素の任意の組合せ、本発明の表現を方法、装置、システム、コンピュータプログラム、データ構造、記録媒体などの間で変換したものもまた、本発明の態様として有効である。 In addition, any combination of the above components, and any conversion of the present invention into a method, device, system, computer program, data structure, recording medium, etc., are also valid aspects of the present invention.

 本発明によれば、サーバからの配信を伴うコンテンツの画像処理において、配信によるユーザ体験の質への影響を軽減できる。 According to the present invention, when processing images of content that is distributed from a server, the impact of distribution on the quality of the user experience can be reduced.

本実施の形態を適用できる画像表示システムの構成例を示す図である。1 is a diagram showing a configuration example of an image display system to which the present embodiment can be applied; 本実施の形態のクライアント端末の内部回路構成を示す図である。FIG. 2 is a diagram illustrating an internal circuit configuration of a client terminal according to the present embodiment. 本実施の形態の画像処理の基本的な流れを、従来技術と比較して示す図である。FIG. 1 is a diagram showing a basic flow of image processing according to the present embodiment in comparison with the prior art. 本実施の形態におけるクライアント端末およびコンテンツサーバの機能ブロックの構成を示す図である。2 is a diagram showing a configuration of functional blocks of a client terminal and a content server according to the present embodiment. FIG. 本実施の形態においてコンテンツサーバが3Dシーン情報を取得する手順を模式的に示す図である。10 is a diagram illustrating a procedure in which a content server acquires 3D scene information in the present embodiment. FIG. 本実施の形態において、異なる範囲の3Dシーン情報を取得する態様を説明するための図である。1A to 1C are diagrams for explaining how 3D scene information of different ranges is acquired in the present embodiment. 本実施の形態において、3Dシーン情報の送受信におけるデータの変遷を模式的に示す図である。10 is a diagram showing a schematic diagram of data transition during transmission and reception of 3D scene information in the present embodiment. FIG. 本実施の形態における、コンテンツサーバにおける機械学習と、クライアント端末における画像描画の時間的関係を説明するための図である。A diagram for explaining the temporal relationship between machine learning in a content server and image drawing in a client terminal in this embodiment. 本実施の形態における、クライアント端末の画像生成部による画像補正処理を説明するための図である。11A and 11B are diagrams for explaining an image correction process performed by an image generating unit of a client terminal in the present embodiment.

 図1は本実施の形態を適用できる画像表示システムの構成例を示す。画像表示システム1は、ユーザ操作に応じて画像を表示させるクライアント端末10a、10b、10cおよび、表示に用いるデータを提供するコンテンツサーバ20を含む。クライアント端末10a、10b、10cにはそれぞれ、ユーザ操作のための入力装置14a、14b、14cと、画像を表示する表示装置16a、16b、16cが接続される。クライアント端末10a、10b、10cとコンテンツサーバ20は、WAN(World Area Network)やLAN(Local Area Network)などのネットワーク8を介して通信を確立できる。 FIG. 1 shows an example of the configuration of an image display system to which this embodiment can be applied. The image display system 1 includes client terminals 10a, 10b, 10c that display images in response to user operations, and a content server 20 that provides data used for display. Input devices 14a, 14b, 14c for user operations and display devices 16a, 16b, 16c that display images are connected to the client terminals 10a, 10b, 10c, respectively. The client terminals 10a, 10b, 10c and the content server 20 can establish communication via a network 8 such as a WAN (World Area Network) or a LAN (Local Area Network).

 クライアント端末10a、10b、10cと、表示装置16a、16b、16cおよび入力装置14a、14b、14cはそれぞれ、有線または無線のどちらで接続されてもよい。あるいはそれらの装置の2つ以上が一体的に形成されていてもよい。例えば図においてクライアント端末10bは、表示装置16bであるヘッドマウントディスプレイに接続している。ヘッドマウントディスプレイは、それを頭部に装着したユーザの動きによって表示画像の視野を変更できるため、入力装置14bとしても機能する。 The client terminals 10a, 10b, 10c may be connected to the display devices 16a, 16b, 16c and the input devices 14a, 14b, 14c either wired or wirelessly. Alternatively, two or more of these devices may be formed integrally. For example, in the figure, the client terminal 10b is connected to a head-mounted display, which is the display device 16b. The head-mounted display can change the field of view of the displayed image according to the movement of the user wearing it on the head, so it also functions as the input device 14b.

 またクライアント端末10cは携帯端末であり、表示装置16cと、その画面を覆うタッチパッドである入力装置14cと一体的に構成されている。このように、図示する装置の外観形状や接続形態は限定されない。ネットワーク8に接続するクライアント端末10a、10b、10cやコンテンツサーバ20の数も限定されない。以後、クライアント端末10a、10b、10cをクライアント端末10、入力装置14a、14b、14cを入力装置14、表示装置16a、16b、16cを表示装置16と総称する。 The client terminal 10c is a portable terminal that is integrated with the display device 16c and the input device 14c, which is a touchpad that covers the screen of the display device 16c. In this way, there are no limitations on the external shape or connection form of the illustrated devices. There are also no limitations on the number of client terminals 10a, 10b, 10c and content servers 20 that are connected to the network 8. Hereinafter, the client terminals 10a, 10b, 10c will be collectively referred to as client terminals 10, the input devices 14a, 14b, 14c as input device 14, and the display devices 16a, 16b, 16c as display device 16.

 入力装置14は、コントローラ、キーボード、マウス、タッチパッド、ジョイスティックなど一般的な入力装置や、ヘッドマウントディスプレイが備えるモーションセンサ、カメラなどの各種センサのいずれか、または組み合わせでよく、クライアント端末10へユーザ操作の内容を供給する。表示装置16は、液晶ディスプレイ、プラズマディスプレイ、有機ELディスプレイ、ウェアラブルディスプレイ、プロジェクタなど一般的なディスプレイでよく、クライアント端末10から出力される画像を表示する。 The input device 14 may be any one or a combination of general input devices such as a controller, keyboard, mouse, touchpad, joystick, or various sensors such as a motion sensor or camera equipped in a head mounted display, and supplies the contents of user operations to the client terminal 10. The display device 16 may be any general display such as a liquid crystal display, plasma display, organic EL display, wearable display, or projector, and displays images output from the client terminal 10.

 コンテンツサーバ20は、画像表示を伴うコンテンツのデータをクライアント端末10に提供する。当該コンテンツの種類は特に限定されず、電子ゲーム、観賞用画像、ウェブページ、アバターによるビデオチャットなどのいずれでもよい。本実施の形態においてコンテンツサーバ20は、入力装置14に対するユーザ操作の情報を逐次、クライアント端末10から取得し、表示対象の世界に反映させたうえ、それを表す画像がクライアント端末10側で表示されるように、必要なデータを送信する。 The content server 20 provides data of content accompanied by image display to the client terminal 10. The type of content is not particularly limited, and may be any of electronic games, decorative images, web pages, video chat using avatars, etc. In this embodiment, the content server 20 sequentially obtains information on user operations on the input device 14 from the client terminal 10, reflects this information in the world to be displayed, and transmits the necessary data so that an image representing this is displayed on the client terminal 10 side.

 図2はクライアント端末10の内部回路構成を示している。クライアント端末10は、CPU(Central Processing Unit)122、GPU(Graphics Processing Unit)124、メインメモリ126を含む。これらの各部は、バス130を介して相互に接続されている。バス130にはさらに入出力インターフェース128が接続されている。入出力インターフェース128には、USBなどの周辺機器インターフェースや、有線又は無線LANのネットワークインターフェースからなる通信部132、ハードディスクドライブや不揮発性メモリなどの記憶部134、表示装置16へデータを出力する出力部136、入力装置14からデータを入力する入力部138、磁気ディスク、光ディスクまたは半導体メモリなどのリムーバブル記録媒体を駆動する記録媒体駆動部140が接続される。 Figure 2 shows the internal circuit configuration of the client terminal 10. The client terminal 10 includes a CPU (Central Processing Unit) 122, a GPU (Graphics Processing Unit) 124, and a main memory 126. These components are interconnected via a bus 130. An input/output interface 128 is also connected to the bus 130. To the input/output interface 128, there are connected a communication unit 132 consisting of a peripheral device interface such as a USB or a network interface for a wired or wireless LAN, a storage unit 134 such as a hard disk drive or non-volatile memory, an output unit 136 that outputs data to the display device 16, an input unit 138 that inputs data from the input device 14, and a recording medium drive unit 140 that drives a removable recording medium such as a magnetic disk, optical disk, or semiconductor memory.

 CPU122は、記憶部134に記憶されているオペレーティングシステムを実行することにより、クライアント端末10の全体を制御する。CPU122はまた、リムーバブル記録媒体から読み出されてメインメモリ126にロードされた、あるいは通信部132を介してダウンロードされた各種プログラムを実行する。GPU124は、CPU122からの描画命令に従って描画処理を行い、表示画像を図示しないフレームバッファに格納する。そしてフレームバッファに格納された表示画像をビデオ信号に変換して出力部136に出力する。メインメモリ126はRAM(Random Access Memory)により構成され、処理に必要なプログラムやデータを記憶する。コンテンツサーバ20も同様の内部回路構成を有してよい。 The CPU 122 executes an operating system stored in the storage unit 134 to control the entire client terminal 10. The CPU 122 also executes various programs that have been read from a removable recording medium and loaded into the main memory 126, or downloaded via the communication unit 132. The GPU 124 performs drawing processing in accordance with drawing commands from the CPU 122, and stores the display image in a frame buffer (not shown). The GPU 124 then converts the display image stored in the frame buffer into a video signal and outputs it to the output unit 136. The main memory 126 is composed of RAM (Random Access Memory), and stores programs and data necessary for processing. The content server 20 may also have a similar internal circuit configuration.

 図3は、本実施の形態の画像処理の基本的な流れを、従来技術と比較して示している。本実施の形態では、様々なオブジェクトが存在する3次元空間の世界を主たる表示対象とする。当該世界の状況は、プログラム等の規定やユーザ操作に応じて変化する。(a)に示す一般的な処理の場合、コンテンツサーバは、ユーザ操作の内容や、表示世界に対する視点の位置、視線の方向の情報を随時取得する。以後、表示対象の3次元空間全体を「表示世界」、表示視野内またはその近傍の表示世界の状態を「シーン」と呼ぶ。また、シーンに対する視点の位置および視線の方向を、単に「視点」と総称する場合がある。視点はユーザが、入力装置14を介して手動で操作してもよいし、ヘッドマウントディスプレイが備えるモーションセンサなどによって、ユーザ頭部の動きから導出してもよい。 Figure 3 shows the basic flow of image processing in this embodiment in comparison with the prior art. In this embodiment, the main display target is a three-dimensional world in which various objects exist. The state of the world changes according to the provisions of the program or the user's operation. In the case of the general processing shown in (a), the content server constantly acquires information on the content of the user's operation, the position of the viewpoint relative to the displayed world, and the direction of the line of sight. Hereinafter, the entire three-dimensional space to be displayed is called the "display world", and the state of the displayed world within or near the display field of view is called the "scene". The position of the viewpoint and the direction of the line of sight relative to the scene may also be collectively referred to simply as the "viewpoint". The viewpoint may be manually operated by the user via the input device 14, or may be derived from the movement of the user's head using a motion sensor provided in the head-mounted display.

 コンテンツサーバは、ユーザ操作に対応するようにシーンを変化させながら、視点情報に対応する視野で表示画像200を描画する。コンテンツサーバは例えば、レイトレーシングやラスタライズなど周知のコンピュータグラフィクス描画技術により、表示画像200を生成する。コンテンツサーバは生成した表示画像200をクライアント端末へ送信し、クライアント端末はそれを表示装置に表示させる。図示する処理を所定のフレームレートで繰り返すことにより、クライアント端末側では、ユーザ操作等に応じたシーンの変化を表す動画像が表示される。 The content server draws the display image 200 in a field of view corresponding to the viewpoint information while changing the scene in response to user operations. The content server generates the display image 200 using well-known computer graphics drawing techniques such as ray tracing and rasterization. The content server transmits the generated display image 200 to the client terminal, which displays it on a display device. By repeating the illustrated process at a specified frame rate, a moving image showing the change in the scene in response to user operations, etc. is displayed on the client terminal side.

 (b)が示す本実施の形態においても、コンテンツサーバ20は、ユーザ操作の内容や、表示世界に対する視点の位置、視線の方向の情報を随時取得し、それに対応するようにシーンを変化させながら、画像を同様に描画する。一方、本実施の形態においてコンテンツサーバ20は、当該画像を学習用画像202とし、機械学習の教師データに用いる。コンテンツサーバ20は学習用画像202を収集して機械学習を行うことにより、シーンの3次元情報を表す3Dシーン情報204を生成する。 In this embodiment shown in (b), the content server 20 also acquires information on the content of user operations, the position of the viewpoint relative to the displayed world, and the direction of the gaze as needed, and similarly renders images while changing the scene accordingly. Meanwhile, in this embodiment, the content server 20 uses the images as training images 202 and as training data for machine learning. The content server 20 collects the training images 202 and performs machine learning to generate 3D scene information 204 that represents three-dimensional information about the scene.

 機械学習により3次元空間の情報を獲得する手法としてNeRF(Neural Radiance Fields)がある(例えば、Ben Mildenhall、外5名、「NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis」、Communications of the ACM、2022年1月、第65巻、第1号、p.99-106参照)。本実施の形態においてNeRFを導入する場合、まず、学習用画像202を生成する際に定めた、それぞれの視点情報、すなわち仮想的な視点の位置と視線の方向を入力とし、対応する学習用画像202を教師データとして、多層パーセプトロン(MLP:Multilayer perceptron)を用いた回帰により、シーンの3次元情報を表すデータを得る。 NeRF (Neural Radiance Fields) is a method for acquiring information about three-dimensional space through machine learning (see, for example, Ben Mildenhall et al., "NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis," Communications of the ACM, January 2022, Vol. 65, No. 1, pp. 99-106). When NeRF is introduced in this embodiment, first, the respective viewpoint information determined when generating the training images 202, i.e., the virtual viewpoint position and line of sight direction, are input, and the corresponding training images 202 are used as teacher data to obtain data representing the three-dimensional information of the scene by regression using a multilayer perceptron (MLP).

 このデータは、3次元空間における位置座標(x,y,z)と方向ベクトルd(θ,φ)からなる5次元のパラメータを入力とし、体積密度σと3原色の色情報c(RGB)を出力とするニューラルネットワークである。本実施の形態では当該ニューラルネットワークのデータを「3Dシーン情報」と呼んでいる。なおNeRFについては様々な改良手法が提案されており、本実施の形態においても具体的な手法は特に限定されない。また機械学習の手法をNeRFに限定する主旨ではない。 This data is a neural network that takes five-dimensional parameters consisting of position coordinates (x, y, z) and directional vector d (θ, φ) in three-dimensional space as input, and outputs volume density σ and color information c (RGB) of the three primary colors. In this embodiment, the data of this neural network is called "3D scene information." Note that various improvement methods have been proposed for NeRF, and the specific method is not particularly limited in this embodiment. Furthermore, it is not intended to limit the machine learning method to NeRF.

 本実施の形態において学習用画像202が表す内容、ひいては3Dシーン情報204は時々刻々と変化し得る。図ではある一時刻、または一時刻と見なせる微小時間におけるシーンの3Dシーン情報204が生成される状況を表している。以後、一時刻、または一時刻と見なせる微小時間におけるシーンを「1つのシーン」と表現する場合がある。なお表示世界に動きがない場合は、時間によらず「1つのシーン」として扱うことができる。また、前のシーンに対し3Dシーン情報204が得られているとき、コンテンツサーバ20は次のシーンに対応するように3Dシーン情報204を更新する。以後、3Dシーン情報の生成および更新を、3Dシーン情報の「取得」と総称する場合がある。 In this embodiment, the content represented by the learning image 202, and therefore the 3D scene information 204, can change from moment to moment. The figure shows the situation in which 3D scene information 204 of a scene at a certain time, or at an infinitesimal time that can be considered as a certain time, is generated. Hereinafter, a scene at a certain time, or at an infinitesimal time that can be considered as a certain time, may be expressed as "one scene". Note that if there is no movement in the displayed world, it can be treated as "one scene" regardless of time. Also, when 3D scene information 204 has been obtained for the previous scene, the content server 20 updates the 3D scene information 204 to correspond to the next scene. Hereinafter, the generation and updating of 3D scene information may be collectively referred to as "obtaining" 3D scene information.

 精度のよい3Dシーン情報204を取得するには、コンテンツサーバ20は1つのシーンの学習用画像202を、なるべく多くの視点から収集することが望ましい。そのためコンテンツサーバ20は、例えば次のような方法で学習用画像202を収集してもよい。
(1)実際に表示される画像の視野を規定する視点の周囲に、学習に適した視点を自ら生成し、対応する画像を生成する
(2)同じシーンを見ている複数のユーザの端末において表示される画像を流用する
In order to obtain accurate 3D scene information 204, it is desirable for the content server 20 to collect learning images 202 of one scene from as many viewpoints as possible. For this reason, the content server 20 may collect the learning images 202, for example, in the following manner.
(1) Generate viewpoints suitable for learning around the viewpoint that defines the field of view of the image that is actually displayed, and generate corresponding images. (2) Reuse images displayed on the devices of multiple users viewing the same scene.

 以後、クライアント端末10における実際の表示視野を規定する視点を「表示用視点」と呼び、学習用画像を生成する際に設定する「学習用視点」と区別する。コンテンツサーバ20は(1)と(2)のどちらか一方のみを実施してもよいし、双方を実施してもよい。例えば(2)によって足りない視点を、(1)によって補ってもよい。コンテンツサーバ20は、学習結果である3Dシーン情報204をクライアント端末10へ送信する。クライアント端末10は、送信された3Dシーン情報204を用いて表示画像206を生成する。 Hereinafter, the viewpoint that defines the actual display field of view in the client terminal 10 is called the "display viewpoint" to be distinguished from the "learning viewpoint" that is set when generating a learning image. The content server 20 may implement only one of (1) and (2), or may implement both. For example, a viewpoint that is missing due to (2) may be supplemented by (1). The content server 20 transmits 3D scene information 204, which is the result of the learning, to the client terminal 10. The client terminal 10 generates a display image 206 using the transmitted 3D scene information 204.

 3Dシーン情報204を用いることにより、クライアント端末10は比較的低い負荷で、シーンを任意の視点から見た様子を高品質に表すことができる。NeRFを適用する場合、クライアント端末10は、表示用視点からビュースクリーンの各画素を通る光線(レイ)rを発生させ、その方向に沿って色を積分していくボリュームレンダリングにより、表示画像の画素値C(r)を次のように求める。 By using the 3D scene information 204, the client terminal 10 can display the scene as it is viewed from any viewpoint with high quality, with a relatively low load. When NeRF is applied, the client terminal 10 generates a ray r that passes through each pixel on the view screen from the display viewpoint, and uses volume rendering to integrate the color along that direction to determine the pixel value C(r) of the display image as follows:

Figure JPOXMLDOC01-appb-M000001
Figure JPOXMLDOC01-appb-M000001

 ここでt、tはそれぞれ、レイrの近位と遠位、T(t)はレイの方向における累積透過率であり、次のように表される。 where t n and t f are the proximal and distal ends of the ray r, respectively, and T(t) is the cumulative transmittance in the direction of the ray, which can be expressed as follows:

Figure JPOXMLDOC01-appb-M000002
Figure JPOXMLDOC01-appb-M000002

 コンテンツサーバ20は、シーンの動きに対し図示する処理を繰り返すことにより、3Dシーン情報204を所定のレートで更新していき、順次、クライアント端末10に送信する。クライアント端末10は、用いる3Dシーン情報を更新しながら表示画像206を生成することにより、コンテンツサーバ20が生成する学習用画像202と同等の変化を有する動画像を、任意の視点から表現できる。例えばクライアント端末10が、表示直前の視点からシーンを見た様子を表す表示画像206を、最新の3Dシーン情報204を用いて描画することにより、視点の変化に対し遅延の少ない画像を表示できる。 The content server 20 repeats the process shown in the figure for the movement of the scene, updating the 3D scene information 204 at a predetermined rate and sequentially transmitting it to the client terminal 10. The client terminal 10 generates a display image 206 while updating the 3D scene information it uses, thereby making it possible to represent moving images from any viewpoint that have the same changes as the learning images 202 generated by the content server 20. For example, the client terminal 10 can display an image with little delay in response to changes in viewpoint by drawing a display image 206 that shows the scene as seen from the viewpoint immediately before display using the latest 3D scene information 204.

 図4は、本実施の形態におけるクライアント端末10およびコンテンツサーバ20の機能ブロックの構成を示している。図示する機能ブロックにおける構成要素の機能は、本明細書にて記載された機能を実現するように構成され又はプログラムされた、汎用プロセッサ、特定用途プロセッサ、集積回路、ASICs (Application Specific Integrated Circuits)、CPU (a Central Processing Unit)、従来型の回路、および/又はそれらの組合せを含む、回路(circuitry)又は処理回路(processing circuitry)において実現されてよい。プロセッサは、トランジスタやその他の回路を含む回路(circuitry)又は処理回路(processing circuitry)とみなされる。プロセッサは、メモリに格納されたプログラムを実行する、プログラムされたプロセッサ(programmed processor)であってもよい。 FIG. 4 shows the functional block configuration of the client terminal 10 and the content server 20 in this embodiment. The functions of the components in the illustrated functional blocks may be realized in circuitry or processing circuitry including general purpose processors, application specific processors, integrated circuits, ASICs (Application Specific Integrated Circuits), a CPU (a Central Processing Unit), conventional circuits, and/or combinations thereof, configured or programmed to realize the functions described herein. A processor is considered to be a circuitry or processing circuitry including transistors and other circuits. A processor may be a programmed processor that executes programs stored in memory.

 本明細書において、回路(circuitry)、ユニット、手段は、記載された機能を実現するようにプログラムされたハードウェア、又は実行するハードウェアである。当該ハードウェアは、本明細書に開示されているあらゆるハードウェア、又は、当該記載された機能を実現するようにプログラムされた、又は、実行するものとして知られているあらゆるハードウェアであってもよい。当該ハードウェアが回路(circuitry)のタイプであるとみなされるプロセッサである場合、当該回路(circuitry)、手段、又はユニットは、ハードウェアと、当該ハードウェア及び又はプロセッサを構成する為に用いられるソフトウェアとの組合せである。 In this specification, a circuit, unit, or means is hardware that is programmed to realize or executes a described function. The hardware may be any hardware disclosed in this specification or any hardware that is programmed to realize or known to execute the described function. If the hardware is a processor, which is considered to be a type of circuit, the circuit, means, or unit is a combination of hardware and software used to configure the hardware and/or processor.

 クライアント端末10は、ユーザ操作などの入力情報を取得する入力情報取得部50、コンテンツサーバ20から3Dシーン情報のデータを取得する3Dシーン情報取得部52、取得した3Dシーン情報のデータを格納する3Dシーン情報記憶部54、表示画像を生成する画像生成部56、および、表示画像のデータを出力する出力部58を備える。入力情報取得部50は、ユーザ操作の内容を入力装置14から随時取得する。ユーザ操作には、コンテンツの選択や起動、実施中のコンテンツに対するコマンド入力などが含まれる。 The client terminal 10 includes an input information acquisition unit 50 that acquires input information such as user operations, a 3D scene information acquisition unit 52 that acquires 3D scene information data from the content server 20, a 3D scene information storage unit 54 that stores the acquired 3D scene information data, an image generation unit 56 that generates a display image, and an output unit 58 that outputs the display image data. The input information acquisition unit 50 acquires the contents of user operations from the input device 14 at any time. User operations include the selection and activation of content, and command input for content currently being executed.

 入力情報取得部50はまた、表示世界に対する表示用視点の情報を随時、あるいは所定の時間間隔で、入力装置14やヘッドマウントディスプレイから取得する。ヘッドマウントディスプレイを装着したユーザの頭部の位置や姿勢を検出し、それに基づき視点の情報を取得する技術は周知であり、本実施の形態においてもそれを適用してよい。入力情報取得部50は、取得した情報をコンテンツサーバ20および画像生成部56に適宜供給する。3Dシーン情報取得部52は、継続的に更新される3Dシーン情報のデータを、コンテンツサーバ20から順次取得する。 The input information acquisition unit 50 also acquires information on the display viewpoint for the displayed world from the input device 14 or head-mounted display at any time or at a specified time interval. Technology for detecting the position and posture of the head of a user wearing a head-mounted display and acquiring viewpoint information based on this is well known, and this may also be applied in this embodiment. The input information acquisition unit 50 supplies the acquired information to the content server 20 and the image generation unit 56 as appropriate. The 3D scene information acquisition unit 52 sequentially acquires data of 3D scene information, which is continuously updated, from the content server 20.

 後述するように3Dシーン情報取得部52は、1つのシーンを表す複数種類の3Dシーン情報をコンテンツサーバ20から取得する。3Dシーン情報記憶部54は、3Dシーン情報取得部52が取得した複数種類の3Dシーン情報のデータを格納する。3Dシーン情報取得部52は、新たな3Dシーン情報を取得すると、3Dシーン情報記憶部54に格納された、同じ種類の3Dシーン情報のデータを更新する。 As described below, the 3D scene information acquisition unit 52 acquires multiple types of 3D scene information representing one scene from the content server 20. The 3D scene information storage unit 54 stores the data of the multiple types of 3D scene information acquired by the 3D scene information acquisition unit 52. When the 3D scene information acquisition unit 52 acquires new 3D scene information, it updates the data of the same type of 3D scene information stored in the 3D scene information storage unit 54.

 画像生成部56は、3Dシーン情報記憶部54に直近に格納された3Dシーン情報のデータを用いて、所定のフレームレートで表示画像を描画する。ここで画像生成部56は、最新の表示用視点を入力情報取得部50から取得し、上述したボリュームレンダリングなどの手法により対応する視野で画像を描画する。機械学習を用いて、シーンの変遷に対応した3Dシーン情報を準備できれば、クライアント端末10が表示画像を生成しても、通常のレイトレーシングなどの処理と比較し軽い負荷で、高品質な画像を描画できる。 The image generation unit 56 uses the 3D scene information data most recently stored in the 3D scene information storage unit 54 to draw a display image at a predetermined frame rate. Here, the image generation unit 56 acquires the latest display viewpoint from the input information acquisition unit 50, and draws an image in the corresponding field of view using a technique such as the volume rendering described above. If 3D scene information corresponding to the transition of the scene can be prepared using machine learning, then even if the client terminal 10 generates the display image, it is possible to draw a high-quality image with a lighter load than with normal processing such as ray tracing.

 画像生成部56は、複数種類の3Dシーン情報を用いて描画を行うことにより、1つのシーンを表す表示画像を時間的、空間的、あるいはその双方で変化させる。例えばコンテンツサーバ20が、情報の密度が低い順に3Dシーン情報のデータを送信する態様において、画像生成部56は、送信された3Dシーン情報を順次用いて表示画像を更新していく。これにより、1つのシーンを表す画像を低遅延で表示できるとともに、徐々に解像度が上がることにより視認上の画質を維持できる。出力部58は、画像補正部92が描画した画像を所定のレートで表示装置16に出力し表示させる。 The image generation unit 56 changes the display image representing one scene in time, space, or both by drawing using multiple types of 3D scene information. For example, in a mode in which the content server 20 transmits 3D scene information data in ascending order of information density, the image generation unit 56 updates the display image using the transmitted 3D scene information in sequence. This allows an image representing one scene to be displayed with low latency, and the visual image quality can be maintained by gradually increasing the resolution. The output unit 58 outputs the image drawn by the image correction unit 92 to the display device 16 at a predetermined rate for display.

 コンテンツサーバ20は、クライアント端末10から入力情報を取得する入力情報取得部70、学習用視点を生成する学習用視点生成部72、表示世界を制御する表示世界制御部74、オブジェクトの3次元モデルを記憶する3次元モデル記憶部76、学習用画像を生成する学習用画像生成部78、複数種類の3Dシーン情報のデータを取得する種類別3Dシーン情報取得部80、取得した3Dシーン情報のデータを格納する3Dシーン情報記憶部84、および、3Dシーン情報のデータをクライアント端末10へ送信する3Dシーン情報送信部86を備える。 The content server 20 includes an input information acquisition unit 70 that acquires input information from the client terminal 10, a learning viewpoint generation unit 72 that generates a learning viewpoint, a display world control unit 74 that controls the display world, a 3D model storage unit 76 that stores 3D models of objects, a learning image generation unit 78 that generates learning images, a type-specific 3D scene information acquisition unit 80 that acquires data on multiple types of 3D scene information, a 3D scene information storage unit 84 that stores the acquired data on the 3D scene information, and a 3D scene information transmission unit 86 that transmits the data on the 3D scene information to the client terminal 10.

 入力情報取得部70は、ユーザ操作の内容や視点の情報を、クライアント端末10から随時、あるいは所定の時間間隔で取得する。学習用視点生成部72は、学習用画像を生成するための学習用視点を複数生成する。学習用視点生成部72は、入力情報取得部70が取得した最新の表示用視点の周囲に、所定の規則で学習用視点を生成する。学習用視点生成部72は例えば、最新の表示用視点を中心とし所定半径の球の内部に均等に、所定数の学習用視点を配置する。ここで所定半径とは、視点の移動に想定される最高速度に、当該データを用いた表示がなされるまでの最長時間を乗算した値などとする。 The input information acquisition unit 70 acquires the contents of user operations and viewpoint information from the client terminal 10 at any time or at a specified time interval. The learning viewpoint generation unit 72 generates multiple learning viewpoints for generating learning images. The learning viewpoint generation unit 72 generates learning viewpoints around the latest display viewpoint acquired by the input information acquisition unit 70 according to a specified rule. For example, the learning viewpoint generation unit 72 places a specified number of learning viewpoints evenly inside a sphere of a specified radius centered on the latest display viewpoint. Here, the specified radius is a value obtained by multiplying the maximum speed expected for viewpoint movement by the longest time until display using the data is performed, for example.

 学習用視点生成部72は、学習用視点を均等に配置するのに限らず、表示世界の状況などに応じて、視点が動くと予想される範囲により多くの学習用視点を配置してもよい。また学習用視点生成部72は、各位置の視点に対し所定数の方向の視線を均等に設定してもよいし、視線が動くと予想される方向により多くの視線を設定してもよい。なお学習用視点生成部72は、最新の表示用視点自体も学習用視点としてよい。 The learning viewpoint generating unit 72 is not limited to distributing the learning viewpoints evenly, but may distribute more learning viewpoints in a range in which the viewpoint is expected to move, depending on the situation in the displayed world, etc. The learning viewpoint generating unit 72 may also set line of sight in a predetermined number of directions evenly for the viewpoint at each position, or may set more line of sight in directions in which the line of sight is expected to move. The learning viewpoint generating unit 72 may also use the latest display viewpoint itself as a learning viewpoint.

 表示世界制御部74は、入力情報取得部70が取得したユーザ操作の内容などに応じて、コンテンツとして表される3次元の表示世界を制御する。例えばコンテンツを電子ゲームとした場合、表示世界制御部74は、電子ゲームの舞台となる仮想空間に、ユーザキャラクタなど必要なオブジェクトを配置し、ユーザが入力したコマンドやプログラムの規定に応じた動きを与える。 The display world control unit 74 controls the three-dimensional display world represented as content according to the content of the user operations acquired by the input information acquisition unit 70. For example, if the content is an electronic game, the display world control unit 74 places necessary objects such as user characters in the virtual space where the electronic game takes place, and gives them movement according to commands entered by the user and program specifications.

 3次元モデル記憶部76には、表示世界に存在するオブジェクトの3次元モデルを格納しておき、表示世界制御部74が適宜読み出すことにより表示世界の構築に用いる。学習用画像生成部78は、学習用視点生成部72が生成した複数の学習用視点から見たシーンの画像を、学習用画像として生成する。学習用画像生成部78は好適には、レイトレーシングなど高品質な画像を描画できる手法を用いて学習用画像を生成する。 The three-dimensional model storage unit 76 stores three-dimensional models of objects that exist in the display world, and the display world control unit 74 reads them out as appropriate to use in constructing the display world. The learning image generation unit 78 generates images of the scene as viewed from multiple learning viewpoints generated by the learning viewpoint generation unit 72 as learning images. The learning image generation unit 78 preferably generates learning images using a technique capable of rendering high-quality images, such as ray tracing.

 種類別3Dシーン情報取得部80は、学習用画像生成部78が生成した学習用画像を用いて、上述したような機械学習により3Dシーン情報を生成するとともに、シーンの変化に対応するように更新していく。ここで種類別3Dシーン情報取得部80は、1つのシーンを表す3Dシーン情報を、複数種類取得する。例えば種類別3Dシーン情報取得部80は、表される情報の空間的な密度が互いに異なる複数の3Dシーン情報を取得する。以後、3Dシーン情報が保有する情報の空間的な密度を「情報密度」と呼ぶ。情報密度は、3Dシーン情報が有する情報の解像度、あるいは情報の空間周波数と言い換えてもよい。 The type-specific 3D scene information acquisition unit 80 uses the learning images generated by the learning image generation unit 78 to generate 3D scene information through machine learning as described above, and updates the information to correspond to changes in the scene. Here, the type-specific 3D scene information acquisition unit 80 acquires multiple types of 3D scene information representing one scene. For example, the type-specific 3D scene information acquisition unit 80 acquires multiple pieces of 3D scene information in which the spatial density of the represented information differs from one another. Hereinafter, the spatial density of the information held by the 3D scene information is referred to as "information density." Information density may also be referred to as the resolution of the information held by the 3D scene information, or the spatial frequency of the information.

 例えば上述したNeRFの文献によれば、学習時にニューラルネットワークに入力するベクトルを、高い周波数を含む高次元空間でのベクトルに変換するPositional Encodingを行うことにより、出力ベクトルの高周波数成分をより正確に表せるようにしている。変換に用いる関数γは次のように表される。 For example, according to the NeRF paper mentioned above, the vectors input to a neural network during training are converted into vectors in a high-dimensional space that includes high frequencies using Positional Encoding, which allows the high-frequency components of the output vector to be represented more accurately. The function γ used for the conversion is expressed as follows:

Figure JPOXMLDOC01-appb-M000003
Figure JPOXMLDOC01-appb-M000003

 上式においてパラメータLを大きくするほど、高い周波数成分を含む詳細な3Dシーン情報を取得できる。これを利用し、種類別3Dシーン情報取得部80は、パラメータLを複数設定して個別に機械学習を行うことにより、情報密度の異なる複数の3Dシーン情報を並行して取得する。ただし3Dシーン情報の情報密度の制御手段はこれに限らない。例えばPositional Encodingの代わりに、複数解像度のグリッドを設定し、その頂点との位置関係に基づき入力ベクトルを表現するMultiresolution Hash Encodingを応用してもよい(例えば、Thomas Muller、外3名、「Instant Neural Graphics Primitives with a Multiresolution Hash Encoding」、ACM Transactions on Graphics、2022年7月、第41巻、第4号、記事102、p.1-15参照)。 In the above formula, the larger the parameter L is, the more detailed 3D scene information including higher frequency components can be acquired. Using this, the type-specific 3D scene information acquisition unit 80 acquires multiple pieces of 3D scene information with different information densities in parallel by setting multiple parameters L and performing machine learning individually. However, the means for controlling the information density of the 3D scene information is not limited to this. For example, instead of Positional Encoding, it is possible to apply Multiresolution Hash Encoding, which sets a grid with multiple resolutions and expresses an input vector based on the positional relationship with its vertices (see, for example, Thomas Muller et al., "Instant Neural Graphics Primitives with a Multiresolution Hash Encoding," ACM Transactions on Graphics, July 2022, Vol. 41, No. 4, Article 102, pp. 1-15).

 この場合、種類別3Dシーン情報取得部80は、グリッドの解像度の数に対応するレベル数Lを複数設定して個別に機械学習を行うことにより、情報密度の異なる複数の3Dシーン情報を並行して取得できる。これらの場合、種類別3Dシーン情報取得部80の内部のメモリには、パラメータLとして設定する値を複数準備しておく。そして種類別3Dシーン情報取得部80は、学習用画像生成部78が1つのシーンに対し生成した複数の学習用画像を順次用いて、情報密度の異なる複数の3Dシーン情報を並行して取得し、3Dシーン情報記憶部84に格納したり、格納済みの同種類の3Dシーン情報を更新したりする。 In this case, the type-specific 3D scene information acquisition unit 80 can acquire multiple pieces of 3D scene information with different information densities in parallel by setting multiple levels L corresponding to the number of grid resolutions and performing machine learning individually. In these cases, multiple values to be set as the parameter L are prepared in the internal memory of the type-specific 3D scene information acquisition unit 80. The type-specific 3D scene information acquisition unit 80 then sequentially uses multiple learning images generated for one scene by the learning image generation unit 78 to acquire multiple pieces of 3D scene information with different information densities in parallel, and stores them in the 3D scene information storage unit 84 or updates the same type of 3D scene information that has already been stored.

 この場合、同じシーンでも情報密度によって学習速度が異なり、情報密度が低いほど早く学習が完了するため、3Dシーン情報の更新も早くなる。なおグリッドを設定して機械学習を行う場合、種類別3Dシーン情報取得部80は、レベル数Lを1つのみ設定して学習し、クライアント端末10への送信時などに、異なるレベルのグリッド(例えばレベル0~1のグリッド、レベル0~2のグリッド、・・・、レベル0~L-1のグリッドなど)を個別に選択してデータを読み出すようにしてもよい。この場合も低レベルのグリッド、すなわち情報密度が低いグリッドほど早く学習が完了する点、ひいては後に述べる効果は同様である。 In this case, even for the same scene, the learning speed differs depending on the information density; the lower the information density, the faster learning is completed, and the faster the 3D scene information is updated. When setting a grid and performing machine learning, the type-specific 3D scene information acquisition unit 80 may set the level number L to only one and learn, and when transmitting to the client terminal 10, etc., select grids of different levels (e.g., grids of levels 0 to 1, grids of levels 0 to 2, ..., grids of levels 0 to L-1, etc.) individually to read out data. In this case, too, the learning is completed more quickly for lower level grids, i.e., grids with lower information density, and the effects described below are the same.

 種類別3Dシーン情報取得部80が取得する3Dシーン情報の種類は、情報密度の区別に限らない。例えば種類別3Dシーン情報取得部80は、表示世界における学習対象の領域やオブジェクトが異なる複数の3Dシーン情報を取得してもよい。この場合、表示世界において学習対象とする範囲が小さいほど早く学習が完了するため、3Dシーン情報の更新も早くなる。 The types of 3D scene information acquired by the type-specific 3D scene information acquisition unit 80 are not limited to distinctions in information density. For example, the type-specific 3D scene information acquisition unit 80 may acquire multiple pieces of 3D scene information with different areas or objects to be learned in the display world. In this case, the smaller the range to be learned in the display world, the faster the learning is completed, and therefore the faster the 3D scene information is updated.

 このように、種類別3Dシーン情報取得部80が取得する複数種類の3Dシーン情報は、同じシーンであっても、それが表す情報の密度や範囲の大きさに依存して、更新完了までに要する時間に差が生じ得る。そのため種類別3Dシーン情報取得部80は、3Dシーン情報記憶部84に格納した複数種類の3Dシーン情報のそれぞれに、反映済みのシーンの時刻を対応づけて記録しておく。なお種類別3Dシーン情報取得部80は、情報密度と学習対象の範囲の組み合わせが異なる複数種類の3Dシーン情報を取得してもよい。 In this way, even if the multiple types of 3D scene information acquired by the type-specific 3D scene information acquisition unit 80 are for the same scene, the time required to complete the update may differ depending on the density of the information and the size of the range it represents. For this reason, the type-specific 3D scene information acquisition unit 80 records the time of the reflected scene in association with each of the multiple types of 3D scene information stored in the 3D scene information storage unit 84. Note that the type-specific 3D scene information acquisition unit 80 may acquire multiple types of 3D scene information with different combinations of information density and range of the learning target.

 3Dシーン情報送信部86は、3Dシーン情報記憶部84に格納された最新の3Dシーン情報のデータを、クライアント端末10へ送信する。3Dシーン情報送信部86は、1つのシーンの学習が完了した3Dシーン情報から順に、クライアント端末10に送信する。例えば情報密度の異なる3Dシーン情報のデータを送信対象とする場合、上述のとおり情報密度が低い3Dシーン情報から順に学習が完了する。したがって3Dシーン情報送信部86は、情報密度が最低の3Dシーン情報の学習が完了した時点でそのデータを送信し、次に情報密度が低い3Dシーン情報の学習が完了した時点でそのデータを送信する、といったように、最高の情報密度の3Dシーン情報までを徐々に送信していく。 The 3D scene information transmission unit 86 transmits the latest 3D scene information data stored in the 3D scene information storage unit 84 to the client terminal 10. The 3D scene information transmission unit 86 transmits to the client terminal 10 3D scene information in order of completion of learning for one scene. For example, when data of 3D scene information with different information densities is to be transmitted, learning is completed in order from the 3D scene information with the lowest information density, as described above. Therefore, the 3D scene information transmission unit 86 transmits data of the 3D scene information with the lowest information density when learning of that information is completed, and transmits data of the 3D scene information with the next lowest information density when learning of that information is completed, and so on, gradually transmitting up to the 3D scene information with the highest information density.

 3Dシーン情報送信部86は内部に分割部88を含む。分割部88は、1つのシーンを表す複数種類の3Dシーン情報のそれぞれを、複数のデータに分割する。1つのシーンを表す3Dシーン情報は、種類ごとに個別のニューラルネットワークにより構成される。分割部88は、各ニューラルネットワークを複数のニューラルネットワークに分割する。ここで分割とは、ハッシュテーブルなどで関連付けられるノードの構造は維持したまま、互いに異なる一部のノードを取り除くことを意味する。分割後のニューラルネットワークにおいて取り除くノードはランダムに決定する。 The 3D scene information transmission unit 86 includes a division unit 88 inside. The division unit 88 divides each of the multiple types of 3D scene information representing one scene into multiple data. Each type of 3D scene information representing one scene is composed of an individual neural network. The division unit 88 divides each neural network into multiple neural networks. Here, division means removing some of the nodes that are different from each other while maintaining the structure of the nodes that are associated by a hash table or the like. The nodes to be removed in the neural network after division are determined randomly.

 過学習を緩和するため、ニューラルネットワークのノードの一部をランダムに不活性化させる、ドロップアウトと呼ばれる手法が知られている(例えば、Nitish Srivastava、外4名、「Dropout: A Simple Way to Prevent Neural Networks from Overfitting」、Journal of Machine Learning Research、2014年6月、第15巻、p.1919-1958参照)。このドロップアウトから明らかなように、ニューラルネットワークは、一部のノードを不活性化しても学習結果の精度は一定以上に保たれる。 To alleviate overfitting, a technique called dropout is known, in which some of the nodes in a neural network are randomly deactivated (see, for example, Nitish Srivastava et al., "Dropout: A Simple Way to Prevent Neural Networks from Overfitting," Journal of Machine Learning Research, June 2014, Vol. 15, pp. 1919-1958). As is clear from dropout, the accuracy of the learning results of a neural network can be maintained at a certain level even if some of the nodes are deactivated.

 この特性を利用すると、ノードの一部を除いた3Dシーン情報を用いても、クライアント端末10は比較的高い精度で表示画像を生成できることになる。そこでコンテンツサーバ20が、同じ3Dシーン情報を表す複数のニューラルネットワークを、一部のノードを除いたうえで送信することにより、データサイズの増大を抑えつつ、パケットロスに対する頑健性を高めることができる。クライアント端末10の画像生成部56は、パケットロスがなければ分割前のニューラルネットワークを復元して表示画像を描画できる。パケットロスがあれば、画像生成部56は、取得できたニューラルネットワークを用いて表示画像を描画する。上述のとおり、この場合も表示画像の生成は可能になる。 By utilizing this characteristic, the client terminal 10 can generate a display image with relatively high accuracy even when using 3D scene information with some of the nodes removed. Therefore, the content server 20 can transmit multiple neural networks representing the same 3D scene information with some of the nodes removed, thereby suppressing an increase in data size and increasing robustness against packet loss. If there is no packet loss, the image generation unit 56 of the client terminal 10 can restore the neural network before division and draw the display image. If there is packet loss, the image generation unit 56 draws the display image using the neural network that it has acquired. As described above, it is possible to generate a display image in this case as well.

 また上述のとおり、コンテンツサーバ20からは、1つのシーンを表す複数種類の3Dシーン情報が、時間差をもって到達する場合がある。このときクライアント端末10の画像生成部56は、新たに取得した3Dシーン情報を用いて表示画像を更新する。表示画像の解像度は3Dシーン情報の情報密度に対応するため、情報密度の低い3Dシーン情報から順に取得することにより、表示画像の解像度が徐々に上がることになる。ここで情報密度の低い3Dシーン情報は、ボリュームレンダリングにおけるサンプリング数が小さくなるため、表示画像の描画速度が高くなる。 As mentioned above, multiple types of 3D scene information representing one scene may arrive from the content server 20 with a time lag. In this case, the image generation unit 56 of the client terminal 10 updates the display image using the newly acquired 3D scene information. Since the resolution of the display image corresponds to the information density of the 3D scene information, the resolution of the display image gradually increases by acquiring 3D scene information in order starting with the 3D scene information with the lowest information density. Here, 3D scene information with low information density has a smaller number of samples in volume rendering, and therefore the drawing speed of the display image increases.

 結果として、コンテンツサーバ20における学習時間の短い、情報密度の低い3Dシーン情報をまず取得し、低解像度の画像を短時間で描画することにより、各シーンの学習開始から表示までの時間を格段に短縮できる。また最終的には、情報密度の高い3Dシーン情報を用いて高精細な画像を表示できるため、見た目への影響を抑えながら低遅延での表示を実現できる。上述のとおり3Dシーン情報の種類は解像度の区別に限らない。例えばコンテンツサーバ20は、ユーザが注視している部分的な領域のみを短時間で学習して先に送信し、シーンの全体の3Dシーン情報を後から送信してもよい。この場合も同様の原理により、注視している領域を低遅延で表示させ、周囲の像を追って更新するような画像表現が可能になる。 As a result, by first acquiring low-information-density 3D scene information that requires a short learning time in the content server 20 and then drawing a low-resolution image in a short time, the time from the start of learning each scene to its display can be significantly shortened. Furthermore, ultimately, high-definition images can be displayed using high-information-density 3D scene information, making it possible to achieve low-latency display while minimizing the impact on appearance. As mentioned above, the types of 3D scene information are not limited to distinctions in resolution. For example, the content server 20 may learn only the partial area that the user is gazing at in a short time and transmit it first, and then transmit the 3D scene information for the entire scene later. In this case as well, a similar principle can be used to display an image in which the area being gazed at is displayed with low latency and updated to track the surrounding image.

 なお3Dシーン情報送信部86は、1つのシーンを表す3Dシーン情報を、種類によって異なる通信プロトコルによりクライアント端末10へ送信してもよい。例えば3Dシーン情報送信部86は、画像の精細さを優先すべき領域の3Dシーン情報については、信頼性の高いTCP/IP(Transmission Control Protocol/Internet Protocol)により送信し、動きの低遅延性を優先すべき領域の3Dシーン情報については、転送速度の高いUDP(User Datagram Protocol)により送信してもよい。このように、3Dシーン情報の種類と、それに適した通信プロトコルとの対応関係は、3Dシーン情報送信部86の内部メモリなどにあらかじめ設定しておく。 The 3D scene information transmitting unit 86 may transmit 3D scene information representing one scene to the client terminal 10 using different communication protocols depending on the type of 3D scene information. For example, the 3D scene information transmitting unit 86 may transmit 3D scene information for an area where image detail should be prioritized using the highly reliable TCP/IP (Transmission Control Protocol/Internet Protocol), and transmit 3D scene information for an area where low latency in movement should be prioritized using UDP (User Datagram Protocol), which has a high transfer rate. In this way, the correspondence between the type of 3D scene information and the appropriate communication protocol is preset in the internal memory of the 3D scene information transmitting unit 86, etc.

 図5は、コンテンツサーバ20が3Dシーン情報を取得する手順を模式的に示している。まず表示世界制御部74は、例えば敵キャラクタ212が存在する表示世界210を構築する。上述のとおり表示世界210には動きがあってよいが、ここでは1つのシーンに対応する表示世界210を表している。これに対し学習用視点生成部72が、最新の表示用視点などに基づき複数の学習用視点を生成し、学習用画像生成部78が、各学習用視点に対応する学習用画像214a、214b、214c等を生成する。なお生成する学習用画像の数は限定されない。 FIG. 5 shows a schematic diagram of the procedure by which the content server 20 acquires 3D scene information. First, the display world control unit 74 constructs a display world 210 in which, for example, an enemy character 212 exists. As mentioned above, the display world 210 may be in motion, but here, the display world 210 corresponding to one scene is shown. In response to this, the learning viewpoint generation unit 72 generates multiple learning viewpoints based on the latest display viewpoint, etc., and the learning image generation unit 78 generates learning images 214a, 214b, 214c, etc. corresponding to each learning viewpoint. Note that there is no limit to the number of learning images to be generated.

 種類別3Dシーン情報取得部80は、学習用画像214a、214b、214c等を用いて機械学習を行い、複数種類の3Dシーン情報を更新する。各3Dシーン情報の実体は、ニューラルネットワーク216a、216b、216c、・・・である。ニューラルネットワーク216a、216b、216c、・・・は情報密度、および表す範囲の少なくともどちらかが異なる。情報密度を異ならせる場合、種類別3Dシーン情報取得部80は例えば、上述したパラメータLをそれぞれ設定することにより学習を行う。この場合、情報密度が低い3Dシーン情報ほど、学習完了が早くなる。 The type-specific 3D scene information acquisition unit 80 performs machine learning using the learning images 214a, 214b, 214c, etc., and updates multiple types of 3D scene information. The substance of each piece of 3D scene information is a neural network 216a, 216b, 216c, .... The neural networks 216a, 216b, 216c, ... differ in at least one of the information density and the range they represent. When the information density is to be varied, the type-specific 3D scene information acquisition unit 80 performs learning, for example, by setting the above-mentioned parameter L for each. In this case, the lower the information density of the 3D scene information, the quicker learning is completed.

 表す範囲を異ならせる場合、種類別3Dシーン情報取得部80は例えば、学習用画像214a、214b、214cから対応する領域を切り出して学習を行う。この場合、表す範囲が狭い3Dシーン情報ほど、学習完了が早くなる。情報密度と表す範囲を組みあわせて変化させる場合、そのバランスによって、学習完了までの時間は様々となる。定性的には、情報密度が高いほど、また、表す範囲が広いほど、学習完了が遅くなるため、適切な遅延時間が得られるように、情報密度と範囲の広さのバランスを最適化しておく。図示する処理を所定のレートで繰り返すことにより、ニューラルネットワーク216a、216b、216c、・・・がそれぞれ、シーンの変遷に応じて更新される。 When the range to be represented is varied, the type-specific 3D scene information acquisition unit 80 performs learning by, for example, cutting out the corresponding area from the learning images 214a, 214b, 214c. In this case, the narrower the range of the 3D scene information represented, the sooner the learning is completed. When the information density and the range to be represented are changed in combination, the time until learning is completed varies depending on the balance. Qualitatively, the higher the information density and the wider the range represented, the slower the learning is completed, so the balance between information density and range is optimized to obtain an appropriate delay time. By repeating the illustrated process at a predetermined rate, the neural networks 216a, 216b, 216c, ... are each updated in accordance with the changes in the scene.

 図6は、異なる範囲の3Dシーン情報を取得する態様を説明するための図である。図5で示した表示世界210の一部のシーンが表示画像222として表されているとする。種類別3Dシーン情報取得部80は例えば、表示画像222上で重要な領域224に対応する、表示世界210内の範囲226を表す3Dシーン情報と、その外側を含むシーン全体を表す3Dシーン情報とを個別に取得する。ここで重要な領域224とは例えば、ユーザの注視点から所定範囲の領域、表示画像の中心から所定範囲の領域、戦況や獲得した物が表されている領域、敵キャラクタやユーザキャラクタなど主要なオブジェクトが存在する領域などであり、領域の選択規則はコンテンツの内容などに応じてあらかじめ設定しておく。 FIG. 6 is a diagram for explaining the manner in which 3D scene information of different ranges is acquired. Assume that a part of the scene of the display world 210 shown in FIG. 5 is displayed as the display image 222. The type-specific 3D scene information acquisition unit 80, for example, separately acquires 3D scene information representing a range 226 in the display world 210 corresponding to an important area 224 on the display image 222, and 3D scene information representing the entire scene including the outside of that range. The important area 224 here is, for example, an area within a predetermined range from the user's gaze point, an area within a predetermined range from the center of the display image, an area showing the battle situation or acquired items, an area where main objects such as enemy characters and user characters exist, etc., and the rules for selecting the area are set in advance according to the content of the content, etc.

 あるいは種類別3Dシーン情報取得部80は、表示世界210に存在する、主要なオブジェクトそのもの、また当該オブジェクトを含む所定サイズの範囲を直接特定し、学習対象として3Dシーン情報を取得してもよい。ユーザの注視点に基づき学習する範囲を決定する場合、周知の注視点検出器をクライアント端末10に設ける。そしてコンテンツサーバ20の入力情報取得部70は、クライアント端末10から注視点の情報を所定のレートで取得し、個別に学習する範囲を決定する。 Alternatively, the type-specific 3D scene information acquisition unit 80 may directly identify the main objects themselves that exist in the display world 210, or a range of a predetermined size that includes the objects, and acquire 3D scene information as the learning target. When determining the learning range based on the user's gaze point, a well-known gaze point detector is provided in the client terminal 10. The input information acquisition unit 70 of the content server 20 then acquires gaze point information from the client terminal 10 at a predetermined rate and determines the learning range individually.

 種類別3Dシーン情報取得部80は、クライアント端末10において表示中の、表示画像222に対応する表示世界の範囲と、その外側を含む、より広い範囲とを個別に学習してもよい。いずれにしろ種類別3Dシーン情報取得部80は、学習用視点生成部72が生成した複数の学習用画像から、対応する一部の領域を切り出して機械学習を行うとともに、学習用画像全体を用いるなどして、より広い範囲の機械学習も行う。 The type-specific 3D scene information acquisition unit 80 may separately learn the range of the display world corresponding to the display image 222 being displayed on the client terminal 10 and a wider range including the outside of that range. In any case, the type-specific 3D scene information acquisition unit 80 performs machine learning by extracting corresponding partial areas from the multiple learning images generated by the learning viewpoint generation unit 72, and also performs machine learning of a wider range by using the entire learning images, for example.

 なお3Dシーン情報が表す範囲のバリエーションは2つに限らず3つ以上としてもよい。また範囲の包含関係は限定されず、独立した複数の範囲のそれぞれに対し3Dシーン情報を取得してもよい。いずれにしろ設定した範囲の数だけ3Dシーン情報が取得される。範囲が狭いほど学習時間および描画時間が短縮されるため、低遅延での表示が可能になる。この特性を利用すれば、重要な領域224に対応する3Dシーン情報の情報密度をある程度高くしても、範囲が狭ければ、情報密度の低い3Dシーン情報を用いて広範囲の画像を描画するのと同等の時間で描画を完了させることができる。結果として、重要な領域224を高精細かつ低遅延で表示させることも可能になる。 The number of variations in ranges represented by the 3D scene information is not limited to two, and may be three or more. The inclusion relationship of the ranges is not limited, and 3D scene information may be obtained for each of multiple independent ranges. In any case, 3D scene information is obtained for the number of set ranges. The narrower the range, the shorter the learning time and drawing time, making it possible to display with low latency. By utilizing this characteristic, even if the information density of the 3D scene information corresponding to the important area 224 is increased to a certain extent, if the range is narrow, it is possible to complete drawing in the same amount of time as drawing a wide-area image using 3D scene information with low information density. As a result, it becomes possible to display the important area 224 with high definition and low latency.

 図7は、3Dシーン情報の送受信におけるデータの変遷を模式的に示している。図の上から下に向けて時間経過を表し、(a)から(c)にかけてはコンテンツサーバ20におけるデータの状態遷移、(c)から(e)にかけてはクライアント端末10におけるデータの状態遷移を表している。またこの図は、1つのシーンを表す3Dシーン情報について示している。まずコンテンツサーバ20の種類別3Dシーン情報取得部80は、(a)に示すように、新たに生成された学習用画像に基づき、複数種類の3Dシーン情報にそれぞれ対応する、複数のニューラルネットワーク230a、230b、・・・を取得する。 Figure 7 shows a schematic diagram of data transitions during transmission and reception of 3D scene information. The diagram shows the passage of time from top to bottom, with (a) to (c) showing data state transitions in the content server 20 and (c) to (e) showing data state transitions in the client terminal 10. This diagram also shows 3D scene information representing one scene. First, the type-specific 3D scene information acquisition unit 80 of the content server 20 acquires multiple neural networks 230a, 230b, ... each corresponding to multiple types of 3D scene information, based on a newly generated learning image, as shown in (a).

 次にコンテンツサーバ20の3Dシーン情報送信部86は、(b)に示すように、ニューラルネットワーク230a、230bをそれぞれ分割する。すなわち3Dシーン情報送信部86は、ニューラルネットワーク230aのうち、互いに異なる一部のノードが除外された複数のニューラルネットワーク232a、232bを生成する。また3Dシーン情報送信部86は、ニューラルネットワーク230bのうち、互いに異なる一部のノードが除外された複数のニューラルネットワーク234a、234bを生成する。 Next, the 3D scene information transmission unit 86 of the content server 20 divides each of the neural networks 230a, 230b as shown in (b). That is, the 3D scene information transmission unit 86 generates a plurality of neural networks 232a, 232b from the neural network 230a in which some mutually different nodes have been removed. The 3D scene information transmission unit 86 also generates a plurality of neural networks 234a, 234b from the neural network 230b in which some mutually different nodes have been removed.

 図では分割後のニューラルネットワーク232a、232b、234a、234bのうち、元のニューラルネットワーク230a、230bから除外されたノードを点線で示している。あるニューラルネットワーク230aの分割後のニューラルネットワーク232a、232bにおいて、一方のニューラルネットワーク232aで除いたノードは、他方のニューラルネットワーク232bで残すようにすると、組みあわせることで元のニューラルネットワーク230aが完全に復元できることになる。 In the figure, the nodes that have been removed from the original neural networks 230a and 230b of the divided neural networks 232a, 232b, 234a, and 234b are shown with dotted lines. If, after dividing a neural network 230a, the neural networks 232a and 232b are obtained and the nodes that have been removed in one neural network 232a are left in the other neural network 232b, then by combining the two, the original neural network 230a can be completely restored.

 ただし上述のように、パケットロスなどにより足りないノードが生じても、残りのニューラルネットワークにより画像の生成が可能である。なおニューラルネットワークの分割数は2つに限らない。3Dシーン情報送信部86の分割部88は好適には、ドロップアウトと同様の手法で、除外対象のノードをランダムに決定する。3Dシーン情報送信部86は分割後のニューラルネットワーク232a、232b、234a、234bをパケット化してクライアント端末10に送信する。 However, as mentioned above, even if there are missing nodes due to packet loss or the like, it is possible to generate an image using the remaining neural networks. The number of divisions of the neural network is not limited to two. The division unit 88 of the 3D scene information transmission unit 86 preferably randomly determines nodes to be excluded using a method similar to dropout. The 3D scene information transmission unit 86 packetizes the divided neural networks 232a, 232b, 234a, and 234b and transmits them to the client terminal 10.

 この際、3Dシーン情報送信部86は(c)に示すように、複数のニューラルネットワークのパケットの送信順をランダムに入れ替える。図ではニューラルネットワーク232b、234b、232a、・・・の順に送信することを、横並びの配列で示している。送信順を入れ替えることにより、連続したパケットロスにより、ある種類の3Dシーン情報が全て失われる可能性を低くできる。ただし入れ替えの対象は、許容される所定時間内に並列に更新されたニューラルネットワーク間で行う。これにより、入れ替えによって描画処理が必要以上に遅延しないようにする。 At this time, the 3D scene information transmission unit 86 randomly rearranges the transmission order of packets from multiple neural networks, as shown in (c). In the figure, the transmission order is shown as being arranged horizontally, with neural networks 232b, 234b, 232a, .... By rearranging the transmission order, it is possible to reduce the possibility that all 3D scene information of a certain type will be lost due to continuous packet loss. However, the rearrangement is performed between neural networks that have been updated in parallel within a certain allowable time. This prevents unnecessary delays in the rendering process due to the rearrangement.

 クライアント端末10の3Dシーン情報取得部52は、パケットを順次取得すると、(d)に示すように、取り出したニューラルネットワークの順序を元に戻すことで、分割前のニューラルネットワークを再構成する。このためコンテンツサーバ20は、送信するニューラルネットワーク232b、234b、232a、・・・のそれぞれに、元のニューラルネットワーク230a、230b、・・・のどれを分割したものなのかを示すメタデータを付加しておく。 When the 3D scene information acquisition unit 52 of the client terminal 10 acquires the packets in sequence, it reconstructs the neural network before division by returning the order of the extracted neural networks to their original order, as shown in (d). For this reason, the content server 20 adds metadata to each of the neural networks 232b, 234b, 232a, ... to be transmitted, indicating which of the original neural networks 230a, 230b, ... it has divided.

 図の例では、3Dシーン情報取得部52は、ニューラルネットワーク232a、232bを取得できたため、元のニューラルネットワーク230aを完全に復元できる。一方、元のニューラルネットワーク230bについては、点線枠236で示すように、分割してなるニューラルネットワークのうち片方のニューラルネットワーク234aがパケットロスにより取得できていない。この場合、元のニューラルネットワーク230bを完全には復元できない。いずれにしろクライアント端末10の画像生成部56は、取得できたニューラルネットワーク232a、232b、234bを用いて、(e)に示すように表示画像238を描画する。 In the example shown in the figure, the 3D scene information acquisition unit 52 is able to acquire the neural networks 232a and 232b, and is therefore able to completely restore the original neural network 230a. On the other hand, as shown by the dotted frame 236, one of the split neural networks, neural network 234a, cannot be acquired due to packet loss for the original neural network 230b. In this case, the original neural network 230b cannot be completely restored. In any case, the image generation unit 56 of the client terminal 10 uses the acquired neural networks 232a, 232b, and 234b to draw the display image 238, as shown in (e).

 具体的には画像生成部56は、最新の表示用視点から見たシーンの画像を、ニューラルネットワーク232a、232b、234bを用いたボリュームレンダリングにより描画する。複数種類の3Dシーン情報が時間差をもって送信される場合、画像生成部56は、最新の3Dシーン情報を用いて、表示画像238の少なくとも一部を更新する。これにより、画像の解像度が徐々に上がったり、重要な像が特に低遅延で動いたりする表示を実現できる。 Specifically, the image generation unit 56 draws an image of the scene seen from the latest display viewpoint by volume rendering using the neural networks 232a, 232b, and 234b. When multiple types of 3D scene information are transmitted with a time difference, the image generation unit 56 updates at least a portion of the display image 238 using the latest 3D scene information. This makes it possible to realize a display in which the image resolution gradually increases and important images move with particularly low latency.

 図8は、コンテンツサーバ20における機械学習と、クライアント端末10における画像描画の時間的関係を説明するための図である。ここでは一例として、情報密度の異なる複数種類の3Dシーン情報を送信するケースを想定する。図の横方向が時間軸であり、コンテンツサーバ20における学習時間、およびクライアント端末10における表示画像の各フレームの描画時間をそれぞれ矩形で示している。各描画時間の矩形に付した番号はフレームの順番を示す。 Figure 8 is a diagram for explaining the temporal relationship between machine learning in the content server 20 and image drawing in the client terminal 10. Here, as an example, we consider a case in which multiple types of 3D scene information with different information densities are transmitted. The horizontal direction of the figure is the time axis, with the learning time in the content server 20 and the drawing time of each frame of the displayed image in the client terminal 10 each shown as a rectangle. The number attached to each drawing time rectangle indicates the frame order.

 上段に示すようにコンテンツサーバ20の種類別3Dシーン情報取得部80は、第1種、第2種、・・・第n種の3Dシーン情報を個別に学習する。ここで序数は情報密度の高さに対応し、第1種が最低の情報密度、第n種が最高の情報密度とする。図示するように、種類別3Dシーン情報取得部80が、複数種類の3Dシーン情報の学習を同時に開始したとしても、情報密度が高いほど学習の完了までに時間を要する。コンテンツサーバ20はまず、第1種の3Dシーン情報の学習が完了した時点で、矢印A1に示すように、クライアント端末10にそれを送信する。 As shown in the upper part, the type-specific 3D scene information acquisition unit 80 of the content server 20 learns the first, second, ... nth types of 3D scene information individually. Here, the ordinal numbers correspond to the information density, with the first type being the lowest information density and the nth type being the highest information density. As shown in the figure, even if the type-specific 3D scene information acquisition unit 80 starts learning multiple types of 3D scene information simultaneously, the higher the information density, the longer it takes to complete the learning. First, when the content server 20 has completed learning the first type of 3D scene information, it transmits it to the client terminal 10, as shown by arrow A1.

 クライアント端末10は、第1種の3Dシーン情報を用いて、時刻t1から、番号0、1、2のフレームの画像を最低解像度で描画していく。一方、コンテンツサーバ20は、第2種の3Dシーン情報の学習が完了した時点で、矢印A2に示すように、クライアント端末10にそれを送信する。クライアント端末10は、第2種の3Dシーン情報を取得した時刻t2の直後に描画を開始する、番号3のフレームから、第2種の情報密度に対応する解像度で画像を描画する。同様の処理を繰り返すことで、フレームの解像度が徐々に増加していく。 The client terminal 10 uses the first type of 3D scene information to draw images of frames numbered 0, 1, and 2 at the lowest resolution from time t1. Meanwhile, when the content server 20 has completed learning the second type of 3D scene information, it transmits it to the client terminal 10, as indicated by arrow A2. The client terminal 10 begins drawing frames numbered 3 immediately after time t2 when the second type of 3D scene information is acquired, and draws images at a resolution corresponding to the second type of information density. By repeating the same process, the frame resolution gradually increases.

 コンテンツサーバ20が第n種の3Dシーン情報の学習を完了したら、矢印Anに示すようにクライアント端末10に送信する。クライアント端末10は、当該3Dシーン情報を取得した時刻tnの直後に描画を開始する、番号m+2のフレームから、第n種の情報密度に対応する最高解像度で画像を描画する。本実施の形態ではこのように、同じシーンを表す3Dシーン情報を複数種類学習し、早くに学習が完了する3Dシーン情報を即時にクライアント端末10に送信する。これにより、例えば第n種など情報密度の高い3Dシーン情報のみを送信する場合と比較し、フレームの描画開始時刻を格段に早めることができる。 Once the content server 20 has completed learning the nth type of 3D scene information, it transmits it to the client terminal 10 as indicated by arrow An. The client terminal 10 begins drawing the image at the highest resolution corresponding to the nth type of information density, starting with frame number m+2, which begins drawing immediately after time tn when the 3D scene information is acquired. In this manner, in this embodiment, multiple types of 3D scene information representing the same scene are learned, and the 3D scene information that is completed learning the earliest is immediately transmitted to the client terminal 10. This allows the frame drawing start time to be significantly earlier than when only 3D scene information with a high information density, such as the nth type, is transmitted.

 すなわちコンテンツサーバ20において学習が開始される時刻を基準とすると、データ伝送のための時間Taと、第1種の3Dシーン情報の学習に要する時間Tbのみの遅延で描画を開始できる。なお実際には、時刻t1の前には、直前に送信された前のシーンの3Dシーン情報を用いたフレームの描画が行われてよい。また画像生成部56は、各フレームの描画時、直前に取得した最新の表示用視点に基づきボリュームレンダリングを行う。これにより、フレーム番号0からm+2にかけて、単に解像度が増加するのみならず、視点の動きも低遅延で反映させた画像を表示できる。ここで第1種など3Dシーン情報の情報密度が低いほど、ボリュームレンダリングにおけるサンプリング数が少なく描画速度が高くなるため、より少ない遅延での表示が可能になる。 In other words, if the time when learning starts in the content server 20 is used as a reference, drawing can start with a delay of only the time Ta for data transmission and the time Tb required for learning the first type of 3D scene information. Note that in practice, before time t1, drawing of a frame may be performed using the 3D scene information of the previous scene transmitted immediately before. Furthermore, when drawing each frame, the image generating unit 56 performs volume rendering based on the most recent display viewpoint acquired immediately before. This not only increases the resolution from frame numbers 0 to m+2, but also makes it possible to display images that reflect the movement of the viewpoint with low delay. Here, the lower the information density of the first type or other 3D scene information, the fewer the number of samples in volume rendering and the higher the drawing speed, making it possible to display with less delay.

 このように画像生成部56は、表示済みか否かに関わらず、一旦、生成された画像を、最新の表示用視点に合うように補正する機能も有しているといえる。図9は、クライアント端末10の画像生成部56による画像補正処理を説明するための図である。画像250aは表示画像のフレームまたはその一部であり、前景として円筒形のオブジェクトの像252aが表されている。視点の動きにより相対的に円筒形のオブジェクトの位置が変化し、Δtの時間経過後の画像250bにおいてオブジェクトの像252bがずれると、画像250aでは隠蔽されていた背景の領域254を新たに描画する必要が生じる。 In this way, image generation unit 56 can be said to have the function of correcting an image that has already been generated so that it matches the latest display viewpoint, regardless of whether it has already been displayed. Figure 9 is a diagram for explaining the image correction process by image generation unit 56 of client terminal 10. Image 250a is a frame or part of a display image, and shows image 252a of a cylindrical object in the foreground. If the position of the cylindrical object changes relatively due to movement of the viewpoint, and object image 252b shifts in image 250b after the lapse of time Δt, it becomes necessary to newly draw background area 254 that was hidden in image 250a.

 3Dシーン情報を用いず、コンテンツサーバ20から送信された画像を表示する従来技術では、元の画像250aでは表されていない領域254を新たに作り出すことが難しい。オブジェクトとその周囲の領域の情報を含む3Dシーン情報を用いれば、最新の視点に対応するように、画像250bの全ての画素を決定できるため、当然、領域254の描画も可能になる。また表示用視点が動き、オブジェクトおよび光源との位置関係が変化すれば、オブジェクトの像252bの色味も変化し得る。これについても3Dシーン情報を用いて、新たな表示用視点から見た像を描画することにより、色味の変化を正確に表現できる。これにより、Δtの時間経過後の画像250cを精度よく生成できる。 In conventional technology that displays images sent from the content server 20 without using 3D scene information, it is difficult to create a new area 254 that is not shown in the original image 250a. If 3D scene information containing information on the object and its surrounding area is used, all pixels of image 250b can be determined to correspond to the latest viewpoint, so it is naturally possible to draw area 254. Furthermore, if the display viewpoint moves and the positional relationship with the object and light source changes, the color of the image 252b of the object may also change. In this case, too, the change in color can be accurately expressed by using the 3D scene information to draw an image seen from a new display viewpoint. This makes it possible to generate image 250c after a time Δt has passed with high accuracy.

 なおこれまで述べたように、クライアント端末10が3Dシーン情報を用いて、表示画像の各フレームを描画することを前提とすると、画像250a、250cはそれぞれ、その時点での最新の3Dシーン情報を用いてクライアント端末10が描画した画像である。一方、コンテンツサーバ20から、3Dシーン情報とともに表示画像も送信する態様を想定し、コンテンツサーバ20から送信された画像250aを、クライアント端末10が3Dシーン情報を用いて補正することも可能である。 As described above, assuming that the client terminal 10 uses 3D scene information to draw each frame of the display image, images 250a and 250c are images that the client terminal 10 has drawn using the latest 3D scene information at that time. On the other hand, assuming a situation in which the content server 20 transmits the display image together with the 3D scene information, it is also possible for the client terminal 10 to correct image 250a transmitted from the content server 20 using the 3D scene information.

 例えばクライアント端末10は、コンテンツサーバ20側で画像250aを生成した時点から、それをクライアント端末10で表示するまでの時間差Δtの間に生じた視点の動きを反映するように補正した画像250cを生成する。この場合もクライアント端末10の画像生成部56は、コンテンツサーバ20が生成した画像250aでは表されていない領域254や、色味が変化した像252bを、最新の3Dシーン情報を用いて描画し直す。 For example, the client terminal 10 generates an image 250c that has been corrected to reflect the movement of the viewpoint that occurs during the time difference Δt between when the image 250a is generated on the content server 20 side and when it is displayed on the client terminal 10. In this case as well, the image generating unit 56 of the client terminal 10 redraws the area 254 that is not shown in the image 250a generated by the content server 20 and the image 252b whose color has changed, using the latest 3D scene information.

 この場合、コンテンツサーバ20において生成された高品質な画像250aと、クライアント端末10で新たに描画した領域とを合成することになるため、画像生成部56は、情報密度の高い3Dシーン情報を用いて必要な領域を描画することが望ましい。なお画像生成部56は、コンテンツサーバ20から送信された画像250aにおける像を移動させたり変形させたりするのみでも違和感が少ない状況においては、3Dシーン情報を用いた新たな描画処理を省略できる。例えば表示用視点の変化に対し像のずれ量や色味の変化が小さい、遠方にあるオブジェクトについては、画像生成部56は、画像250aに直接加工を施すなどして補正してもよい。 In this case, since the high-quality image 250a generated by the content server 20 is to be combined with the area newly drawn by the client terminal 10, it is desirable for the image generation unit 56 to draw the necessary area using 3D scene information with high information density. Note that the image generation unit 56 can omit new drawing processing using 3D scene information in a situation where there is little sense of incongruity simply by moving or deforming the image in the image 250a transmitted from the content server 20. For example, for distant objects where the amount of image shift or color change is small in response to changes in the display viewpoint, the image generation unit 56 may perform correction by directly processing the image 250a.

 そのため画像生成部56にはあらかじめ、3Dシーン情報を用いて描画し直す必要がある領域を判定する規則を、内部のメモリなどに格納しておく。例えば画像生成部56は、表示用視点の速度がしきい値を超えたとき、または超えると予測されるとき、視点からの距離がしきい値以下のオブジェクトとその周囲の領域について、3Dシーン情報を用いて描画し直してもよい。この場合、コンテンツサーバ20は、表示世界のジオメトリ情報もクライアント端末10に送信する。これにより画像生成部56は、表示用視点とオブジェクトとの距離の変化を取得できる。 For this reason, the image generation unit 56 stores in advance in an internal memory or the like rules for determining areas that need to be redrawn using 3D scene information. For example, when the speed of the display viewpoint exceeds a threshold value, or is predicted to exceed it, the image generation unit 56 may use 3D scene information to redraw objects whose distance from the viewpoint is less than or equal to the threshold value, and their surrounding areas. In this case, the content server 20 also transmits geometry information of the display world to the client terminal 10. This allows the image generation unit 56 to obtain changes in the distance between the display viewpoint and the object.

 以上述べた本実施の形態によれば、コンテンツサーバ20は、コンテンツの表示世界を表す学習用画像を生成し、各シーンの3Dシーン情報を複数種類、取得する。コンテンツサーバ20は学習が完了した3Dシーン情報を順次、クライアント端末10に送信し、クライアント端末10はそれを用いて、最新の表示用視点に対応する表示画像のフレームを描画する。3Dシーン情報の種類として、情報の密度や表す範囲の広さを様々に設定することにより、学習時間が変化し、ひいては表示までの遅延時間も制御できる。さらに部分的に詳細度を高めることもできるため、画像上の重要性などを加味した、低遅延で高品質な画像を表示できる。 According to the present embodiment described above, the content server 20 generates learning images that represent the display world of the content, and acquires multiple types of 3D scene information for each scene. The content server 20 sequentially transmits the 3D scene information for which learning has been completed to the client terminal 10, and the client terminal 10 uses it to draw frames of the display image corresponding to the latest display viewpoint. By setting various types of 3D scene information, such as the density of information and the width of the range to be represented, the learning time can be changed, and the delay time until display can also be controlled. Furthermore, the level of detail can be increased in some areas, making it possible to display high-quality images with low latency that take into account the importance of the image, etc.

 またコンテンツサーバ20は、1つの3Dシーン情報を構成するニューラルネットワークを、複数のニューラルネットワークに分割してクライアント端末10に送信する。これにより、データサイズの増大を抑えつつ、パケットロス対する頑健性を高められる。以上のことから、コンテンツサーバ20からの配信を伴う画像処理において、配信が介在することによる影響を軽減でき、ユーザ体験の質を高めることができる。 The content server 20 also divides the neural network that constitutes one piece of 3D scene information into multiple neural networks and transmits them to the client terminal 10. This makes it possible to improve robustness against packet loss while suppressing increases in data size. As a result, in image processing involving distribution from the content server 20, the impact of distribution can be reduced, improving the quality of the user experience.

 以上、本発明を実施の形態をもとに説明した。実施の形態は例示であり、それらの各構成要素や各処理プロセスの組合せにいろいろな変形例が可能なこと、またそうした変形例も本発明の範囲にあることは当業者に理解されるところである。 The present invention has been described above based on an embodiment. The embodiment is merely an example, and it will be understood by those skilled in the art that various modifications are possible in the combination of each component and each processing process, and that such modifications are also within the scope of the present invention.

 以上のように本発明は、コンテンツサーバ、ゲーム装置、ヘッドマウントディスプレイ、表示装置、携帯端末、パーソナルコンピュータなど各種情報処理装置や、それらのいずれかを含む画像表示システムなどに利用可能である。 As described above, the present invention can be used in various information processing devices such as content servers, game devices, head-mounted displays, display devices, mobile terminals, and personal computers, as well as image display systems that include any of these.

 本開示は、以下の態様を含んでよい。
[項目1]
 コンテンツサーバであって、以下のように構成された回路(circuitry configured to)を備え、
 前記回路(circuitry)は、
 ユーザ操作に応じて状況が変化する3次元の表示世界の各シーンを、複数の視点から見た様子を表す画像を、学習用画像として生成し、
 前記学習用画像を教師データとする機械学習により、各シーンの3次元情報を表す複数種類の3Dシーン情報を取得し、
 前記3Dシーン情報を用いて表示画像を描画するクライアント端末に、前記複数種類の3Dシーン情報のデータを、シーンごとの機械学習が完了した順に送信する、
 コンテンツサーバ。
[項目2]
 前記回路(circuitry)は、
 空間的な情報の密度が異なる、複数の前記3Dシーン情報を取得する、項目1に記載のコンテンツサーバ。
[項目3]
 前記回路(circuitry)は、
 1つの前記3Dシーン情報を構成するニューラルネットワークを、互いに異なる一部のノードが除外された複数のニューラルネットワークに分割したうえ、前記クライアント端末に送信する、項目1に記載のコンテンツサーバ。
[項目4]
 前記回路(circuitry)は、分割後の前記ニューラルネットワークをパケット化するとともに、その送信順をランダムに入れ替える、項目3に記載のコンテンツサーバ。
[項目5]
 前記回路(circuitry)は、前記表示世界における異なる範囲を表す、複数の前記3Dシーン情報を取得する、項目1に記載のコンテンツサーバ。
[項目6]
 前記回路(circuitry)は、前記クライアント端末において表示されている画像において定めた領域に対応する、前記表示世界の範囲を表す前記3Dシーン情報を取得する、項目5に記載のコンテンツサーバ。
[項目7]
 前記回路(circuitry)は、前記表示世界に存在するオブジェクトを対象とする3Dシーン情報を取得する、項目5に記載のコンテンツサーバ。
[項目8]
 前記回路(circuitry)は、前記複数種類の3Dシーン情報のデータを、異なる通信プロトコルで前記クライアント端末に送信する、項目1に記載のコンテンツサーバ。
[項目9]
 クライアント端末であって、以下のように構成された回路(circuitry configured to)を備え、
 前記回路(circuitry)は、
 ユーザ操作の情報と、3次元の表示世界に対する表示用視点の情報とを取得し、
 前記ユーザ操作に応じて状況が変化する前記表示世界の各シーンに対し、機械学習により取得された、3次元情報を表す複数種類の3Dシーン情報のデータを、サーバから取得し、
 最新の前記表示用視点に基づき、直近に取得された前記3Dシーン情報を用いて、表示画像のフレームの少なくとも一部を描画する、
 クライアント端末。
[項目10]
 前記回路(circuitry)は、1つの前記3Dシーン情報を構成するニューラルネットワークを分割してなる、互いに異なる一部のノードが除外された複数のニューラルネットワークを取得し、分割前のニューラルネットワークを再構成する、項目9に記載のクライアント端末。
[項目11]
 前記回路(circuitry)は、空間的な情報の密度が異なる複数の前記3Dシーン情報を、当該情報の密度が低い順に取得し、
 取得された前記3Dシーン情報の前記情報の密度に対応するように、前記フレームの解像度を変化させる、項目9に記載のクライアント端末。
[項目12]
 前記回路(circuitry)は、前記サーバから送信された表示画像のうち、前記表示用視点の変化により描画が必要と判定された領域を、前記3Dシーン情報を用いて描画する、項目9に記載のクライアント端末。
[項目13]
 ユーザ操作に応じて状況が変化する3次元の表示世界の画像を表示させるクライアント端末と、表示画像の生成に用いるデータを送信するコンテンツサーバと、を含み、前記クライアント端末と前記コンテンツサーバは、以下のように構成された回路(circuitry configured to)を備え、
 前記コンテンツサーバが備える回路(circuitry)は、
 前記表示世界の各シーンを、複数の視点から見た様子を表す画像を、学習用画像として生成し、
 前記学習用画像を教師データとする機械学習により、各シーンの3次元情報を表す複数種類の3Dシーン情報を取得し、
 前記クライアント端末に、前記複数種類の3Dシーン情報のデータを、シーンごとの機械学習が完了した順に送信し、
 前記クライアント端末が備える回路(circuitry)は、
 前記ユーザ操作の情報と、前記表示世界に対する表示用視点の情報とを取得し、
 前記複数種類の3Dシーン情報のデータを、前記コンテンツサーバから取得し、
 最新の前記表示用視点に基づき、直近に取得された前記3Dシーン情報を用いて、表示画像のフレームの少なくとも一部を描画する、
 画像表示システム。
[項目14]
 ユーザ操作に応じて状況が変化する3次元の表示世界の各シーンを、複数の視点から見た様子を表す画像を、学習用画像として生成し、
 前記学習用画像を教師データとする機械学習により、各シーンの3次元情報を表す複数種類の3Dシーン情報を取得し、
 前記3Dシーン情報を用いて表示画像を描画するクライアント端末に、前記複数種類の3Dシーン情報のデータを、シーンごとの機械学習が完了した順に送信する、
 表示用データ送信方法。
[項目15]
 ユーザ操作の情報と、3次元の表示世界に対する表示用視点の情報とを取得し、
 前記ユーザ操作に応じて状況が変化する前記表示世界の各シーンに対し、機械学習により取得された、3次元情報を表す複数種類の3Dシーン情報のデータを、サーバから取得し、
 最新の前記表示用視点に基づき、直近に取得された前記3Dシーン情報を用いて、表示画像のフレームの少なくとも一部を描画する、
 表示画像生成方法。
[項目16]
 ユーザ操作に応じて状況が変化する3次元の表示世界の各シーンを、複数の視点から見た様子を表す画像を、学習用画像として生成する機能と、
 前記学習用画像を教師データとする機械学習により、各シーンの3次元情報を表す複数種類の3Dシーン情報を取得する機能と、
 前記3Dシーン情報を用いて表示画像を描画するクライアント端末に、前記複数種類の3Dシーン情報のデータを、シーンごとの機械学習が完了した順に送信する機能と、
 をコンピュータに実現させるためのプログラムを記録した記録媒体。
[項目17]
 ユーザ操作の情報と、3次元の表示世界に対する表示用視点の情報とを取得する機能と、
 前記ユーザ操作に応じて状況が変化する前記表示世界の各シーンに対し、機械学習により取得された、3次元情報を表す複数種類の3Dシーン情報のデータを、サーバから取得する機能と、
 最新の前記表示用視点に基づき、直近に取得された前記3Dシーン情報を用いて、表示画像のフレームの少なくとも一部を描画する機能と、
 をコンピュータ実現させるためのプログラムを記録した記録媒体。
The present disclosure may include the following aspects.
[Item 1]
A content server, comprising:
The circuitry includes:
Generate learning images that represent scenes from multiple viewpoints of a three-dimensional display world in which the situation changes in response to user operations;
acquiring a plurality of types of 3D scene information representing three-dimensional information of each scene by machine learning using the learning images as training data;
Transmitting the plurality of types of 3D scene information data to a client terminal that draws a display image using the 3D scene information in the order in which machine learning for each scene is completed;
Content server.
[Item 2]
The circuitry includes:
2. The content server according to item 1, wherein a plurality of pieces of 3D scene information are acquired, each having a different spatial information density.
[Item 3]
The circuitry includes:
A content server according to item 1, which divides a neural network constituting one of the 3D scene information into multiple neural networks from which some nodes that are different from each other are excluded, and transmits the multiple neural networks to the client terminal.
[Item 4]
4. The content server according to claim 3, wherein the circuitry packetizes the neural network after division and randomly changes the transmission order of the packets.
[Item 5]
2. The content server of claim 1, wherein the circuitry obtains a plurality of pieces of 3D scene information representing different areas in the display world.
[Item 6]
6. The content server of claim 5, wherein the circuitry obtains the 3D scene information representing a range of the display world that corresponds to a defined area in an image displayed on the client terminal.
[Item 7]
6. The content server of claim 5, wherein the circuitry obtains 3D scene information for objects present in the display world.
[Item 8]
2. The content server according to claim 1, wherein the circuitry transmits data of the plurality of types of 3D scene information to the client terminal using different communication protocols.
[Item 9]
A client terminal, comprising:
The circuitry includes:
acquiring information on a user operation and information on a display viewpoint for a three-dimensional display world;
Acquire from a server a plurality of types of 3D scene information data representing three-dimensional information acquired by machine learning for each scene in the displayed world whose situation changes in response to the user operation;
Rendering at least a portion of a frame of a display image using the most recently acquired 3D scene information based on an updated viewing viewpoint.
Client terminal.
[Item 10]
The client terminal according to item 9, wherein the circuitry obtains a plurality of neural networks obtained by dividing a neural network that constitutes one of the 3D scene information, with some nodes that are different from each other being excluded, and reconstructs the neural network before division.
[Item 11]
The circuitry acquires a plurality of pieces of 3D scene information having different spatial information densities in ascending order of spatial information density;
10. A client terminal as described in item 9, which varies the resolution of the frames to correspond to the information density of the acquired 3D scene information.
[Item 12]
10. The client terminal according to item 9, wherein the circuitry uses the 3D scene information to draw an area of the display image transmitted from the server that is determined to require drawing due to a change in the display viewpoint.
[Item 13]
A client terminal that displays an image of a three-dimensional display world in which a situation changes in response to a user operation, and a content server that transmits data used to generate the display image, the client terminal and the content server having a circuitry configured to:
The circuitry of the content server includes:
generating images representing each scene of the display world as seen from a plurality of viewpoints as learning images;
By machine learning using the learning images as training data, a plurality of types of 3D scene information representing three-dimensional information of each scene is obtained;
Transmitting the plurality of types of 3D scene information data to the client terminal in the order in which machine learning for each scene is completed;
The circuitry of the client terminal includes:
acquiring information on the user operation and information on a display viewpoint for the display world;
acquiring the plurality of types of 3D scene information data from the content server;
Rendering at least a portion of a frame of a display image using the most recently acquired 3D scene information based on an updated viewing viewpoint.
Image display system.
[Item 14]
Generate learning images that represent scenes from multiple viewpoints of a three-dimensional display world in which the situation changes in response to user operations;
By machine learning using the learning images as training data, a plurality of types of 3D scene information representing three-dimensional information of each scene is obtained;
Transmitting the plurality of types of 3D scene information data to a client terminal that draws a display image using the 3D scene information in the order in which machine learning for each scene is completed;
A method for transmitting data for display.
[Item 15]
Acquiring information on a user operation and information on a display viewpoint for a three-dimensional display world;
Acquire from a server a plurality of types of 3D scene information data representing three-dimensional information acquired by machine learning for each scene in the displayed world whose situation changes in response to the user operation;
Rendering at least a portion of a frame of a display image using the most recently acquired 3D scene information based on an updated viewing viewpoint.
Display image generation method.
[Item 16]
A function of generating images for learning that represent each scene in a three-dimensional display world, the situation of which changes in response to user operations, as seen from multiple viewpoints;
A function of acquiring a plurality of types of 3D scene information representing three-dimensional information of each scene by machine learning using the learning images as training data;
A function of transmitting the data of the plurality of types of 3D scene information to a client terminal that draws a display image using the 3D scene information in the order in which machine learning for each scene is completed;
A recording medium on which a program for realizing the above on a computer is recorded.
[Item 17]
A function for acquiring information on user operations and information on a display viewpoint for a three-dimensional display world;
A function of acquiring from a server a plurality of types of 3D scene information data representing three-dimensional information acquired by machine learning for each scene in the displayed world whose situation changes in response to the user operation;
Rendering at least a portion of a frame of a display image using the most recently acquired 3D scene information based on an updated display viewpoint;
A recording medium on which a program for implementing the above on a computer is recorded.

 1 画像表示システム、 10 クライアント端末、 14 入力装置、 16 表示装置、 20 コンテンツサーバ、 50 入力情報取得部、 52 3Dシーン情報取得部、 54 3Dシーン情報記憶部、 56 画像生成部、 58 出力部、 70 入力情報取得部、 72 学習用視点生成部、 74 表示世界制御部、 76 3次元モデル記憶部、 78 学習用画像生成部、 80 種類別3Dシーン情報取得部、 84 種類別3Dシーン情報記憶部、 86 3Dシーン情報送信部、 88 分割部、 122 CPU、 124 GPU、 126 メインメモリ。 1 Image display system, 10 Client terminal, 14 Input device, 16 Display device, 20 Content server, 50 Input information acquisition unit, 52 3D scene information acquisition unit, 54 3D scene information storage unit, 56 Image generation unit, 58 Output unit, 70 Input information acquisition unit, 72 Learning viewpoint generation unit, 74 Display world control unit, 76 3D model storage unit, 78 Learning image generation unit, 80 Type-specific 3D scene information acquisition unit, 84 Type-specific 3D scene information storage unit, 86 3D scene information transmission unit, 88 Splitting unit, 122 CPU, 124 GPU, 126 Main memory.

Claims (17)

 ユーザ操作に応じて状況が変化する3次元の表示世界の各シーンを、複数の視点から見た様子を表す画像を、学習用画像として生成する学習用画像生成部と、
 前記学習用画像を教師データとする機械学習により、各シーンの3次元情報を表す複数種類の3Dシーン情報を取得する種類別3Dシーン情報取得部と、
 前記3Dシーン情報を用いて表示画像を描画するクライアント端末に、前記複数種類の3Dシーン情報のデータを、シーンごとの機械学習が完了した順に送信する3Dシーン情報送信部と、
 を備えたことを特徴とするコンテンツサーバ。
a learning image generating unit that generates, as learning images, images that represent how each scene in a three-dimensional displayed world, whose situation changes in response to a user operation, is viewed from a plurality of viewpoints;
a type-specific 3D scene information acquisition unit that acquires multiple types of 3D scene information representing three-dimensional information of each scene by machine learning using the learning images as teacher data;
a 3D scene information transmission unit that transmits data of the plurality of types of 3D scene information to a client terminal that draws a display image using the 3D scene information in the order in which machine learning for each scene is completed;
A content server comprising:
 前記種類別3Dシーン情報取得部は、空間的な情報の密度が異なる、複数の前記3Dシーン情報を取得することを特徴とする請求項1に記載のコンテンツサーバ。 The content server according to claim 1, characterized in that the type-specific 3D scene information acquisition unit acquires multiple pieces of 3D scene information with different spatial information densities.  前記3Dシーン情報送信部は、1つの前記3Dシーン情報を構成するニューラルネットワークを、互いに異なる一部のノードが除外された複数のニューラルネットワークに分割したうえ、前記クライアント端末に送信することを特徴とする請求項1または2に記載のコンテンツサーバ。 The content server according to claim 1 or 2, characterized in that the 3D scene information transmission unit divides the neural network that constitutes one piece of 3D scene information into multiple neural networks from which some mutually different nodes are excluded, and transmits the divided neural networks to the client terminal.  前記3Dシーン情報送信部は、分割後の前記ニューラルネットワークをパケット化するとともに、その送信順をランダムに入れ替えることを特徴とする請求項3に記載のコンテンツサーバ。 The content server according to claim 3, characterized in that the 3D scene information transmission unit packetizes the neural network after division and randomly changes the transmission order.  前記種類別3Dシーン情報取得部は、前記表示世界における異なる範囲を表す、複数の前記3Dシーン情報を取得することを特徴とする請求項1または2に記載のコンテンツサーバ。 The content server according to claim 1 or 2, characterized in that the type-specific 3D scene information acquisition unit acquires multiple pieces of 3D scene information representing different ranges in the display world.  前記種類別3Dシーン情報取得部は、前記クライアント端末において表示されている画像において定めた領域に対応する、前記表示世界の範囲を表す前記3Dシーン情報を取得することを特徴とする請求項5に記載のコンテンツサーバ。 The content server according to claim 5, characterized in that the type-specific 3D scene information acquisition unit acquires the 3D scene information that represents the range of the display world that corresponds to an area defined in the image displayed on the client terminal.  前記種類別3Dシーン情報取得部は、前記表示世界に存在するオブジェクトを対象とする3Dシーン情報を取得することを特徴とする請求項5に記載のコンテンツサーバ。 The content server according to claim 5, characterized in that the type-specific 3D scene information acquisition unit acquires 3D scene information targeting objects present in the display world.  前記3Dシーン情報送信部は、前記複数種類の3Dシーン情報のデータを、異なる通信プロトコルで前記クライアント端末に送信することを特徴とする請求項1に記載のコンテンツサーバ。 The content server according to claim 1, characterized in that the 3D scene information transmission unit transmits the data of the multiple types of 3D scene information to the client terminal using different communication protocols.  ユーザ操作の情報と、3次元の表示世界に対する表示用視点の情報とを取得する入力情報取得部と、
 前記ユーザ操作に応じて状況が変化する前記表示世界の各シーンに対し、機械学習により取得された、3次元情報を表す複数種類の3Dシーン情報のデータを、サーバから取得する3Dシーン情報取得部と、
 最新の前記表示用視点に基づき、直近に取得された前記3Dシーン情報を用いて、表示画像のフレームの少なくとも一部を描画する画像生成部と、
 を備えたことを特徴とするクライアント端末。
an input information acquisition unit that acquires information on a user operation and information on a display viewpoint for a three-dimensional display world;
a 3D scene information acquisition unit that acquires from a server a plurality of types of 3D scene information data representing three-dimensional information acquired by machine learning for each scene in the displayed world whose situation changes in response to the user operation;
an image generator configured to render at least a portion of a frame of a display image using the most recently acquired 3D scene information based on a latest display viewpoint;
A client terminal comprising:
 前記3Dシーン情報取得部は、1つの前記3Dシーン情報を構成するニューラルネットワークを分割してなる、互いに異なる一部のノードが除外された複数のニューラルネットワークを取得し、分割前のニューラルネットワークを再構成することを特徴とする請求項9に記載のクライアント端末。 The client terminal according to claim 9, characterized in that the 3D scene information acquisition unit acquires multiple neural networks obtained by dividing a neural network constituting one piece of 3D scene information, with some mutually different nodes removed, and reconstructs the neural network before division.  前記3Dシーン情報取得部は、空間的な情報の密度が異なる複数の前記3Dシーン情報を、当該情報の密度が低い順に取得し、
 前記画像生成部は、取得された前記3Dシーン情報の前記情報の密度に対応するように、前記フレームの解像度を変化させることを特徴とする請求項9または10に記載のクライアント端末。
The 3D scene information acquisition unit acquires a plurality of pieces of 3D scene information having different spatial information densities in order of decreasing density of the information,
11. The client terminal according to claim 9, wherein the image generating unit changes a resolution of the frames so as to correspond to the information density of the acquired 3D scene information.
 前記画像生成部は、前記サーバから送信された表示画像のうち、前記表示用視点の変化により描画が必要と判定された領域を、前記3Dシーン情報を用いて描画することを特徴とする請求項9または10に記載のクライアント端末。 The client terminal according to claim 9 or 10, characterized in that the image generation unit uses the 3D scene information to render an area of the display image transmitted from the server that is determined to require rendering due to a change in the display viewpoint.  ユーザ操作に応じて状況が変化する3次元の表示世界の画像を表示させるクライアント端末と、表示画像の生成に用いるデータを送信するコンテンツサーバと、を含み、
 前記コンテンツサーバは、
 前記表示世界の各シーンを、複数の視点から見た様子を表す画像を、学習用画像として生成する学習用画像生成部と、
 前記学習用画像を教師データとする機械学習により、各シーンの3次元情報を表す複数種類の3Dシーン情報を取得する種類別3Dシーン情報取得部と、
 前記クライアント端末に、前記複数種類の3Dシーン情報のデータを、シーンごとの機械学習が完了した順に送信する3Dシーン情報送信部と、
 を備え、
 前記クライアント端末は、
 前記ユーザ操作の情報と、前記表示世界に対する表示用視点の情報とを取得する入力情報取得部と、
 前記複数種類の3Dシーン情報のデータを、前記コンテンツサーバから取得する3Dシーン情報取得部と、
 最新の前記表示用視点に基づき、直近に取得された前記3Dシーン情報を用いて、表示画像のフレームの少なくとも一部を描画する画像生成部と、
 を備えたことを特徴とする画像表示システム。
The present invention includes a client terminal that displays an image of a three-dimensional display world in which a situation changes in response to a user operation, and a content server that transmits data used to generate the display image,
The content server,
a learning image generating unit that generates images representing how each scene in the display world is viewed from a plurality of viewpoints as learning images;
a type-specific 3D scene information acquisition unit that acquires multiple types of 3D scene information representing three-dimensional information of each scene by machine learning using the learning images as teacher data;
a 3D scene information transmission unit that transmits the plurality of types of 3D scene information data to the client terminal in the order in which machine learning for each scene is completed;
Equipped with
The client terminal includes:
an input information acquisition unit that acquires information on the user operation and information on a display viewpoint for the display world;
a 3D scene information acquisition unit that acquires the plurality of types of 3D scene information data from the content server;
an image generator configured to render at least a portion of a frame of a display image using the most recently acquired 3D scene information based on a latest display viewpoint;
An image display system comprising:
 ユーザ操作に応じて状況が変化する3次元の表示世界の各シーンを、複数の視点から見た様子を表す画像を、学習用画像として生成するステップと、
 前記学習用画像を教師データとする機械学習により、各シーンの3次元情報を表す複数種類の3Dシーン情報を取得するステップと、
 前記3Dシーン情報を用いて表示画像を描画するクライアント端末に、前記複数種類の3Dシーン情報のデータを、シーンごとの機械学習が完了した順に送信するステップと、
 を含むことを特徴とする表示用データ送信方法。
generating, as learning images, images representing how each scene in a three-dimensional displayed world, in which a situation changes in response to a user operation, is viewed from a plurality of viewpoints;
acquiring a plurality of types of 3D scene information representing three-dimensional information of each scene by machine learning using the learning images as training data;
transmitting data of the plurality of types of 3D scene information to a client terminal that renders a display image using the 3D scene information in the order in which machine learning for each scene is completed;
A display data transmission method comprising:
 ユーザ操作の情報と、3次元の表示世界に対する表示用視点の情報とを取得するステップと、
 前記ユーザ操作に応じて状況が変化する前記表示世界の各シーンに対し、機械学習により取得された、3次元情報を表す複数種類の3Dシーン情報のデータを、サーバから取得するステップと、
 最新の前記表示用視点に基づき、直近に取得された前記3Dシーン情報を用いて、表示画像のフレームの少なくとも一部を描画するステップと、
 を含むことを特徴とする表示画像生成方法。
acquiring information on a user operation and information on a display viewpoint for a three-dimensional display world;
acquiring from a server a plurality of types of 3D scene information data representing three-dimensional information acquired by machine learning for each scene in the displayed world whose situation changes in response to the user operation;
Rendering at least a portion of a frame of a display image using the most recently acquired 3D scene information based on an updated viewing viewpoint;
A display image generating method comprising:
 ユーザ操作に応じて状況が変化する3次元の表示世界の各シーンを、複数の視点から見た様子を表す画像を、学習用画像として生成する機能と、
 前記学習用画像を教師データとする機械学習により、各シーンの3次元情報を表す複数種類の3Dシーン情報を取得する機能と、
 前記3Dシーン情報を用いて表示画像を描画するクライアント端末に、前記複数種類の3Dシーン情報のデータを、シーンごとの機械学習が完了した順に送信する機能と、
 をコンピュータに実現させることを特徴とするコンピュータプログラム。
A function of generating images for learning that represent each scene in a three-dimensional display world, the situation of which changes in response to user operations, as seen from multiple viewpoints;
A function of acquiring a plurality of types of 3D scene information representing three-dimensional information of each scene by machine learning using the learning images as training data;
A function of transmitting the data of the plurality of types of 3D scene information to a client terminal that draws a display image using the 3D scene information in the order in which machine learning for each scene is completed;
A computer program characterized by causing a computer to execute the above.
 ユーザ操作の情報と、3次元の表示世界に対する表示用視点の情報とを取得する機能と、
 前記ユーザ操作に応じて状況が変化する前記表示世界の各シーンに対し、機械学習により取得された、3次元情報を表す複数種類の3Dシーン情報のデータを、サーバから取得する機能と、
 最新の前記表示用視点に基づき、直近に取得された前記3Dシーン情報を用いて、表示画像のフレームの少なくとも一部を描画する機能と、
 をコンピュータに実現させることを特徴とするコンピュータプログラム。
A function for acquiring information on user operations and information on a display viewpoint for a three-dimensional display world;
A function of acquiring from a server a plurality of types of 3D scene information data representing three-dimensional information acquired by machine learning for each scene in the displayed world whose situation changes in response to the user operation;
Rendering at least a portion of a frame of a display image using the most recently acquired 3D scene information based on an updated display viewpoint;
A computer program characterized by causing a computer to execute the above.
PCT/JP2023/043347 2023-12-04 2023-12-04 Content server, client terminal, image display system, display data transmission method, and display image generation method Pending WO2025120714A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
PCT/JP2023/043347 WO2025120714A1 (en) 2023-12-04 2023-12-04 Content server, client terminal, image display system, display data transmission method, and display image generation method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/JP2023/043347 WO2025120714A1 (en) 2023-12-04 2023-12-04 Content server, client terminal, image display system, display data transmission method, and display image generation method

Publications (1)

Publication Number Publication Date
WO2025120714A1 true WO2025120714A1 (en) 2025-06-12

Family

ID=95979643

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/JP2023/043347 Pending WO2025120714A1 (en) 2023-12-04 2023-12-04 Content server, client terminal, image display system, display data transmission method, and display image generation method

Country Status (1)

Country Link
WO (1) WO2025120714A1 (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160300385A1 (en) * 2014-03-14 2016-10-13 Matterport, Inc. Processing and/or transmitting 3d data
JP2020005201A (en) * 2018-06-29 2020-01-09 日本放送協会 Transmitting device and receiving device
US20210035352A1 (en) * 2018-03-20 2021-02-04 Pcms Holdings, Inc. System and method for dynamically adjusting level of details of point clouds
CN114004941A (en) * 2022-01-04 2022-02-01 苏州浪潮智能科技有限公司 Indoor scene three-dimensional reconstruction system and method based on nerve radiation field
JP2023543538A (en) * 2020-07-31 2023-10-17 グーグル エルエルシー Robust viewpoint synthesis for unconstrained image data

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160300385A1 (en) * 2014-03-14 2016-10-13 Matterport, Inc. Processing and/or transmitting 3d data
US20210035352A1 (en) * 2018-03-20 2021-02-04 Pcms Holdings, Inc. System and method for dynamically adjusting level of details of point clouds
JP2020005201A (en) * 2018-06-29 2020-01-09 日本放送協会 Transmitting device and receiving device
JP2023543538A (en) * 2020-07-31 2023-10-17 グーグル エルエルシー Robust viewpoint synthesis for unconstrained image data
CN114004941A (en) * 2022-01-04 2022-02-01 苏州浪潮智能科技有限公司 Indoor scene three-dimensional reconstruction system and method based on nerve radiation field

Similar Documents

Publication Publication Date Title
EP3760287B1 (en) Method and device for generating video frames
JP7531568B2 (en) Eye tracking with prediction and latest updates to the GPU for fast foveated rendering in HMD environments
JP7391939B2 (en) Prediction and throttle adjustment based on application rendering performance
JP7164630B2 (en) Dynamic Graphics Rendering Based on Predicted Saccade Landing Points
US10937220B2 (en) Animation streaming for media interaction
US10089790B2 (en) Predictive virtual reality display system with post rendering correction
US9832451B2 (en) Methods for reduced-bandwidth wireless 3D video transmission
JP6298432B2 (en) Image generation apparatus, image generation method, and image generation program
US8909506B2 (en) Program, information storage medium, information processing system, and information processing method for controlling a movement of an object placed in a virtual space
CN111522433B (en) Method and system for determining current gaze direction
CN110832442A (en) Optimized shadows and adaptive mesh skinning in foveated rendering systems
US11107183B2 (en) Adaptive mesh skinning in a foveated rendering system
KR100623173B1 (en) Game character animation implementation system, implementation method and production method
WO2025120714A1 (en) Content server, client terminal, image display system, display data transmission method, and display image generation method
JP2003305275A (en) Game program
US20240282034A1 (en) Apparatus, systems and methods for animation data
KR102192153B1 (en) Method and program for providing virtual reality image
US12422978B2 (en) User interaction management to reduce lag in user-interactive applications
KR102179810B1 (en) Method and program for playing virtual reality image
WO2025094514A1 (en) Content processing device and content processing method
WO2025094266A1 (en) Content server, client terminal, image display system, draw data transmission method, and display image generation method
JP2004005182A (en) Visualization method
WO2024089725A1 (en) Image processing device and image processing method
WO2025094265A1 (en) Content server and content processing method
WO2025094264A1 (en) Image processing device and image processing method

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 23960729

Country of ref document: EP

Kind code of ref document: A1