WO2026018019A1

WO2026018019A1 - Processing a three-dimensional representation of a scene

Info

Publication number: WO2026018019A1
Application number: PCT/GB2025/051602
Authority: WO
Inventors: Gael HONOREZ; Tristan SALOME; Guido MEARDI
Original assignee: V Nova International Ltd
Current assignee: V Nova International Ltd
Priority date: 2024-07-19
Filing date: 2025-07-18
Publication date: 2026-01-22
Anticipated expiration: 2027-01-19
Also published as: GB2643380A; GB202410600D0

Abstract

There is described a method of processing a three-dimensional representation of a scene, the method comprising: rendering one or more two-dimensional objects based on points of the three-dimensional representation; identifying a modification of one or more of the objects; identifying one or more points of the three-dimensional representation that are associated with the identified two-dimensional objects; and outputting the modifications and the identified points of the three-dimensional representation.

Description

Processing a three-dimensional representation of a scene

Field of the Disclosure

The present disclosure relates to methods, systems, and apparatuses for processing a three-dimensional representation of a scene.

Background to the Disclosure

Three-dimensional representations of environments are used in many contexts, including for the generation of virtual reality videos, in which depth information for a plurality of points of the representation is used to generate different images for a left eye and a right eye of a user. Typically, substantial processing power is required to determine such a three-dimensional representation, and the file size of files associated with these representations is typically large so that substantial amounts of storage are needed to keep the files and substantial amounts of bandwidth are required to transfer the files.

Summary of the Disclosure

According to another aspect of the present disclosure, there is described a method of processing a three- dimensional representation of a scene, the method comprising: rendering one or more two-dimensional objects based on points of the three-dimensional representation; identifying a modification of one or more of the objects; identifying one or more points of the three-dimensional representation that are associated with the identified two-dimensional objects; and outputting the modifications and the identified points of the three-dimensional representation.

Preferably, the method comprises identifying a modification during a compositing stage.

Preferably, the method comprises: for one or more of the points of the three-dimensional representation, generating a first set of arbitrary output values (AOVs); and rendering the one or more two-dimensional objects based on the first set of AOVs.

Preferably, the first set of AOVs defines location information of the two-dimensional objects. Preferably, the first set of AOVs is defined based on location information of the one or more points of the three-dimensional representation. Preferably, the first set of AOVs comprises one or more of: an AOV defining a normal value of a corresponding point; an AOV defining a capture device associated with a corresponding point; and an AOV defining a distance of a corresponding point from a capture device.

Preferably, the first set of AOVs is non-modifiable.

Preferably, identifying a modification of the objects comprises identifying a modification of a colour of an object and/or identifying a modification of an AOV associated with an object and/or a point. Preferably, the modified AOV is an editable AOV.

Preferably, the one or more points are associated with: a first set of AOVs that defines a location of the point, preferably wherein the first set of AOVs is non-modifiable; and a second set of AOVs that defines a colour of the point. Preferably, the second set of AOVs is modifiable.

Preferably, the method comprises: determining that a modification has been made to one or more of the objects; and re-rendering the two-dimensional objects based on the modification. Preferably, the method comprises re-rendering the two-dimensional objects based on the first set of AOVs for the one or more points of the three-dimensional representation.

Preferably, the method comprises: associating one or more arbitrary output values (AOVs) with the two- dimensional objects; identifying a modification of one or more of the AOVs; and identifying the two- dimensional objects that are associated with the modified AOVs.

Preferably, the method comprises modifying the identified points. Preferably, the method comprises modifying the identified points based on the modifications to the AOVs. Preferably, the method comprises: storing (e.g. at the time of rendering the two-dimensional object) a correspondence between the two-dimensional objects and corresponding points of the three-dimensional representation; wherein identifying the points of the three-dimensional representation comprises identifying the points based on the correspondences.

Preferably, the method comprises: determining one or more datafields associated with the points; and generating one or more AOVs based on the values of the datafields. Preferably, the datafields define a location and/or one or more attributes of the points.

Preferably, the method comprises: generating a two-dimensional array based on the points, wherein each entry of the two-dimensional array is associated with a point of the three-dimensional representation; and associating each entry of the array with one or more AOVs, the AOVs representing attributes of the point associated with said entry.

Preferably, the AOVs indicate a location of each point in the three-dimensional representations.

Preferably, the AOVs indicate an attribute value of each point, preferably wherein the AOVs one or more of: a normal; a transparency; a colour; a left eye attribute value; and a right eye attribute value.

Preferably, the method comprises determining a transformation that converts the two-dimensional array into a two-dimensional image that represents the scene. Preferably, the transformation is determined based on the values of the AOVs associated with each entry.

Preferably, rendering the objects comprises: determining one or more scene files, wherein at least one scene file comprises a three-dimensional representation of the scene; and rendering one or more two- dimensional objects based on the scene files.

Preferably, the method comprises compositing the rendered two-dimensional objects to form a two- dimensional immersive image.

Preferably, the rendered two-dimensional objects comprise one or more of: one or more two-dimensional objects rendered based on the three-dimensional representation; and a two-dimensional background image.

Preferably, the compositing comprises superimposing a two-dimensional object rendered based on the three-dimensional representation onto a two-dimensional background image.

Preferably, the method comprises rendering a plurality of layers associated with the three-dimensional representation, wherein each layer comprises a two-dimensional image and wherein each layer is associated within one or more points of the three-dimensional representation that are similar distances from a viewing zone of the three-dimensional representation.

Preferably, each layer is associated with a respective set of AOVs.

Preferably, the method comprises: identifying a modification to one or more AOVs; identifying one or more points of the three-dimensional representation that are associated with the AOVs; and modifying the points of the three-dimensional representation based on the identified modification to the one or more AOVs.

Preferably, the modification relates to one or more of: an attribute value; a colour; a location; a normal; and a transparency.

Preferably, identifying the modification to one or more AOVs comprises: identifying a modification to a value of a point of the three-dimensional representation; and updating a value of an AOV relating to this value based on said modification.

Preferably, the three-dimensional representation is associated with a viewing zone, the viewing zone comprising a subset of the scene and/or the viewing zone enabling a user to move through a subset of the scene, preferably wherein the user is able to move within the viewing zone with six degrees of freedom (6DoF). Preferably, the viewing zone has a volume of less than 50% of the volume of the scene, less than 20% of the volume of the scene, and/or less than 10% of the volume of the scene. Preferably, the viewing zone has, or is associated with, a volume, preferably a real-world volume, of less than five cubic metres (5m3), less than one cubic metre (1 m3), less than one-tenth of a cubic metre (0.1 m3) and/or less than one- hundredth of a cubic metre (0.01 m3).

Preferably, the three-dimensional representation comprises a point cloud.

Preferably, the method comprises storing the three-dimensional representation and/or outputting the three- dimensional representation. Preferably, the method comprises outputting the three-dimensional representation to a further computer device.

Preferably, the method comprises generating an image and/or a video based on the three-dimensional representation.

Preferably, the method comprises forming one or more two-dimensional representations of the scene based on the three-dimensional representation. Preferably, the method comprises forming a two-dimensional representation for each eye of a viewer.

Preferably, the point is associated with one or more of: a location; an attribute; a transparency; a colour; and a size.

Preferably, the point is associated with an attribute for a right eye and an attribute for a left eye.

Preferably, the scene comprises one or more of: an extended reality (XR) scene; a virtual reality (VR) scene; an augmented reality (AR) scene; and a mixed reality (MR) scene.

Preferably, the method comprises forming a bitstream that includes the point.

According to another aspect of the present disclosure, there is described a system for carrying out the aforesaid method, the system comprising one or more of: a processor; a communication interface; and a display.

According to another aspect of the present disclosure, there is described an apparatus for processing a three-dimensional representation of a scene, the apparatus comprising: means for (e.g. a processor for) rendering one or more two-dimensional objects based on points of the three-dimensional representation; means for (e.g. a processor for) identifying a modification of one or more of the objects; means for (e.g. a processor for) identifying one or more points of the three-dimensional representation that are associated with the identified two-dimensional objects; and means for (e.g. a processorfor) outputting the modifications and the identified points of the three-dimensional representation.

According to another aspect of the present disclosure, there is described a bitstream comprising one or more points modified using the aforesaid method.

According to another aspect of the present disclosure, there is described an apparatus (e.g. an encoder) for forming and/or encoding the aforesaid bitstream.

According to another aspect of the present disclosure, there is described an apparatus (e.g. a decoder) for receiving and/or decoding the aforesaid bitstream.

According to another aspect of the present disclosure, there is described a system for processing a three- dimensional representation of a scene, the method comprising: a viewer module for: rendering one or more two-dimensional objects based on points of the three-dimensional representation; identifying a modification of one or more of the objects; and re-rendering the two-dimensional objects based on the modification; and an editing module for: identifying one or more points of the three-dimensional representation that are associated with the identified two-dimensional objects; and outputting the modifications and the identified points of the three-dimensional representation.

Preferably, the system comprises a reader module for: determining, for one or more points of the three- dimensional representation, a first set of arbitrary output values (AOVs), the first set of AOVs defining location information for the one or more points, wherein the viewer module is arranged to render the two- dimensional objects based on the first set of AOVs for the points.

Any feature in one aspect of the disclosure may be applied to other aspects of the invention, in any appropriate combination. In particular, method aspects may be applied to apparatus aspects, and vice versa.

Furthermore, features implemented in hardware may be implemented in software, and vice versa. Any reference to software and hardware features herein should be construed accordingly.

Any apparatus feature as described herein may also be provided as a method feature, and vice versa. As used herein, means plus function features may be expressed alternatively in terms of their corresponding structure, such as a suitably programmed processor and associated memory.

It should also be appreciated that particular combinations of the various features described and defined in any aspects of the disclosure can be implemented and/or supplied and/or used independently.

The disclosure also provides a computer program and a computer program product comprising software code adapted, when executed on a data processing apparatus, to perform any of the methods described herein, including any or all of their component steps.

The disclosure also provides a computer program and a computer program product comprising software code which, when executed on a data processing apparatus, comprises any of the apparatus features described herein.

The disclosure also provides a computer program and a computer program product having an operating system which supports a computer program for carrying out any of the methods described herein and/or for embodying any of the apparatus features described herein.

The disclosure also provides a computer readable medium having stored thereon the computer program as aforesaid.

The disclosure also provides a signal carrying the computer program as aforesaid, and a method of transmitting such a signal.

The disclosure extends to methods and/or apparatus substantially as herein described with reference to the accompanying drawings.

The disclosure will now be described, by way of example, with reference to the accompanying drawings.

Description of the Drawings

Figure 1 shows a system for generating a sequence of images.

Figure 2 shows a computer device on which components of the system of Figure 1 may be implemented.

Figure 3 shows a method of determining a three-dimensional representation of a scene.

Figures 4a and 4b show method of determining a point based on a plurality of sub-points.

Figure 5 shows a scene comprising a viewing zone.

Figures 6a and 6b show arrangements of capture devices for determining points of the three-dimensional representation. Figure 7 shows different versions of a point that may be captured by different capture devices.

Figures 8a and 8b shows grids formed by the different capture devices.

Figure 9 shows an arrangement for indicating an angle of a point from a capture device used to capture the point.

Figure 10 shows a method of rendering a final image based on an intermediate image.

Figure 11 shows a method of generating a composite image.

Figures 12 and 13 show a method of modifying a three-dimensional representation based on one or more arbitrary output values (AOVs).

Figure 14 shows modules of an image processing system.

Figure 15 shows a bitstream.

Description of the Preferred Embodiments

Referring to Figure 1 , there is shown a system for generating a sequence of images. This system can be used to generate, and then display, a representation of an environment, which may comprise a VR environment (or an XR environment).

The system comprises an image generator 11 , an encoder 12, a transmitter 13, a network 14, a receiver 15, a decoder 16 and a display device 17.

These components may each be implemented on separate apparatuses. Equally, various combinations of these components may be implemented on a shared apparatus; for example, the image generator 11 , the encoder 12, and the transmitter 13 may all be part of a single image data generation device. Similarly, the receiver 15, the decoder 16, and the display device 17 may all be a part of a single image rendering device.

Typically, the system comprises at least one encoding computer device (e.g. a server of a content provider) and at least one rendering computer device (e.g. a VR headset).

Referring to Figure 2, each of the components, and in particular the image generator 11 , the encoder 12, the transmitter 13, the receiver 15, the decoder 16 and the display device 17 is typically implemented on a computer device 20, where, as described above, a plurality of these components may be implemented on a shared computer device.

Each computer device comprises one or more of: a processor 21 for executing instructions (e.g. so as to perform one or more of the steps of the various methods described below), a communication interface 22 for facilitating communication between computer devices (e.g. an ethernet interface, a Bluetooth® interface, or a universal serial bus (UBS) interface, a memory 23 and/or storage 24 for storing information and instructions (e.g. a random access memory (RAM), a read only memory (ROM), a hard drive disk (HDD) a solid state drive (SSD), and/or a flash memory, and a user interface 25 (e.g. a display, a mouse, and/or a keyboard) for enabling a user to interact with the computer device. These components may be coupled to one another by a bus 25 of the computer device.

The computer device 20 may comprise further (or fewer) components. In particular, the computer device (e.g. the display device 17) may comprise one or more sensors, such as an accelerometer, a GPS sensor, or a light sensor. These sensors typically enable the computer device to identify an environmental condition and/or an action of wearer of the display device.

Turning back to Figure 1 , the image generator 11 is configured to generate a sequence of image data (e.g. a sequence of image frames) to enable the display device 17 to use this image data to display a plurality of images. The image data may comprise one or more digital objects and the image data may be generated or encoded in any format. For example, the image data may comprise point cloud data, where each point has a 3D position and one or more attributes. These attributes may, for example, include, a surface colour, a transparency value, an object size and a surface normal direction. Each attribute may have a value chosen from a continuous range or may have a value chosen from a discrete set.

The image data enables the later rendering of images. This image data may enable a direct rendering (e.g. the image data may directly represent an image). Equally, the image data may require further processing in order to enable rendering. For example, the image data may comprise three-dimensional point cloud data, where rendering a two-dimensional image using this data requires processing based on a viewpoint of this two-dimensional image.

The image data may comprise depth map data, where one or more pixels or objects in the image is associated with a depth that is specified by the depth map data. The depth map data may be provided as a depth map layer, separate from an image layer. In some contexts, such as MPEG Immersive Video (MIV), the image layer may instead be described as a texture layer. Similarly, in some contexts, the depth map layer may instead be described as a geometry layer.

The image data may include a predicted display window location. The predicted display window location may indicate a portion of an image that is likely to be displayed by the display device 17. The predicted display window location may be based on a viewing position (such as a virtual position and/or orientation of the user in a 3D environment) of the user, where this viewing position may be obtained from the display device. The predicted display window location may be defined using one or more coordinates. For example, the predicted display window location may be defined using the coordinates of a corner or center of a predicted display window, and may be defined using a size of the predicted display window. The predicted display window location may be encoded as part of metadata included with the frame.

The image data for each image (e.g. each frame) may include further information, which may be provided as a part of an image, e.g. as part of the point cloud data, or as separate layers. In particular, the image data may include audio information or haptic feedback information indicating audio or haptics which can accompany displayed visual data. An audio layer or haptic layer may accompany each image, and may be omitted for images where no accompanying audio or haptics are required.

Similarly, the image data may comprise interactivity information, where the image data may contain or indicate elements with which a user can interact. The interactivity information may, for example, define a behaviour of an element, where a user is able to interact with the element based on this behaviour. The behaviour typically defines a change in an element that occurs as a result of a user interaction where this change may comprise a change in the attributes of the element or in the rendering of the element. As an example, where an image contains a target element, the target element may be arranged to disappear when a user interacts with this element, or to provide feedback indicating that the user has interacted with the target. This interactivity data may be provided as part of, or separately to, the image data.

The image data may indicate, or may be combinable with, a state of the virtual environment, a position of a user, ora viewing direction of the user. Here, the position and viewing direction may be physical properties of the user in the real-world, or position and viewing direction may also be purely virtual, for example being controlled using a handheld controller. The image generator 11 may, for example, obtain information from the display device 17 that indicates the position, viewing direction, or motion of the user. Equally, the image generator may generate image data such that it can later be combined with this position, viewing direction, or motion, where the image generator may generate a full scene which is only partially viewed by a user depending on the position of that user.

In some cases, the generated image may be independent of user position and viewing direction. This type of image generation typically requires significant computer resources such as a powerful GPU, and may be implemented in a cloud service, or on a local but powerful computer. For example, a cloud service (such as a Cloud Rendering Service (CRN)) may reduce the cost per-user and thereby make the image frame generation more accessible to a wider range of users. Here “rendering” refers at least to an initial stage of rendering to generate an image. Further rendering may occur at the display device 17 based on the generated image to produce a final image which is displayed.

The image generator 11 may, for example, comprise a rendering engine for initially rendering a virtual environment such as a game or a virtual meeting room.

The encoder 12 is configured to encode frames to be transmitted to the display device 17. The encoder may be implemented using executable software or may be implemented on specific hardware such as an ASIC. In some embodiments, the image generator 11 may transmit raw, unencoded, data through the network 14. However, such transmission typically leads to a high file size and requires a high bandwidth so that it is typically desirable to encode the data prior to the transmission.

The encoder 12 may encode the image data in a lossless manner or may encode the data a lossy manner. The encoder may apply inter-frame or intra-frame compression based on a currently-encoded frame and optionally one or more previously encoded frames. The encoder may be a multi-layer encoder, such as a low complexity enhancement video codec (LCEVC) enabled encoder.

Where the generated frames comprise depth map data, the encoder 12 may perform layered encoding on each instance of image data (e.g. each frame) to generate an encoded frame comprising a base depth map layer and an enhancement depth map layer. Encoding a depth map in this way may improve compression. In some applications, such as HDR video, depth maps are desirably highly detailed with a bit depth of up to twelve or fourteen bits, which is a significant increase in the data to be transmitted. As a result, providing ways to improve compression of the depth map can make more realistic depth map-based displays viable when performing rendering or transmission of rendered data in real-time. Furthermore, this type of layered encoding makes it easy to drop (and then pick back up) one or more of the layers, which provides flexibility and tools for bandwidth management.

Layered encoding is also helpful as the final decoder/user device (such as a user display device) can choose whether to process these extra layers. For example, in a non-layered approach, the best the end device (i.e. the receiver, decoder or display device associated with a user that will view the images) can do is determine that it does not have enough resources for a given quality (be it resolution, frame rate, inclusion of depth map) and then signal to the controller/renderer/encoder that it does not have enough resources. The controller then will send future images at a lower quality. In that alternative scenario, the end device still unfortunately has to process the higher quality data until the lower quality data arrives, if it can process the received images at all.

In some of the described embodiments, this situation is improved upon because when/if the end device determines for example that it does not have the processing capabilities to handle the highest level of quality, then it can drop and/or choose not to process certain layers. The end device may also signal to the controller that it needs a lower level of quality, but in the meantime the end device can only process the number of layers that it can handle. Therefore, the end device can react to conditions much more quickly.

In some cases, depth map data may be embedded in image data. In this case, the base depth map layer may be a base image layer with embedded depth map data, and the enhancement depth map layer may be an enhancement image layer with embedded depth map data.

Alternatively, when the generated images comprise a depth map layer separate from an image layer and multi-layer encoding is applied, the encoded depth map layers may be separate from the encoded image layers. This has the advantage that the encoded depth map layers can be dropped under some conditions while still retaining image layers that can be displayed (albeit with a lower level of realism). For example, the encoded depth map layers can be dropped by a transmitter or encoder when available communication resources are reduced, or can be dropped by an end device which lacks the processing resources to handle the highest level of quality.

Similarly, if some images comprise an audio base layer, a haptic feedback base layer, an audio enhancement layer or a haptic feedback enhancement layer, these can be processed or dropped flexibly.

Again similarly, if some images comprise an interactivity data base layer or an interactivity enhancement layer these can be processed or dropped flexibly. For example, certain interactions may only be possible where a threshold bandwidth is available, where complex interactions (e.g. those enabling a conversation with a digital object) may be disabled before less complex interactions (e.g. changing a pixel colour) are disabled.

Additionally or alternatively, where the image data comprises point cloud data, the encoder may apply a point cloud data encoding technique such as described in European patent application EP21386059.6, which is incorporated herein by reference. Such a point cloud encoder may act as a base encoder for a layered encoding technique such as LCEVC or VC-6. Notably LCEVC and VC-6 techniques encode and decode a layered signal, but are agnostic about the content type of data encoded in the signal. For example, the signal can include textures, video frames, geometry or depth data, meshes, point clouds, rendering attributes or physics engine attributes.

The transmitter 13 may be any known type of transmitter for wired or wireless communications, including an Ethernet transmitter or a Bluetooth transmitter.

The transmitter 13 may be configured to make decisions about how to transmit the image data, and/or may provide feedback to the encoder 12 or the image generator 11 . For example, the transmitter may determine available communication resources (e.g. bandwidth) for transmitting image data, and may drop one or more layers from an encoded frame, or indicate to the image generator and/or encoder that image data should be generated and encoded with fewer layers, when insufficient bandwidth is available for transmission of all generated data. As specific examples, the transmitter may be configured to drop a depth map layer, an LCEVC enhancement layer, or a VC-6 enhancement layer from a frame when insufficient communication resources are available.

The network 14 provides a channel for communication between the transmitter 13 and the receiver 15, and may be any known type of network such as a WAN or LAN or a wireless Wi-Fi or Bluetooth network. The network may further be a composite of several networks of different types. Many users only have access to a network with a bandwidth of 30MBps which can lead to latency jitter when streaming. The required bandwidth and the observed latency can be reduced by means of tactics such as forward-looking rendering and last-millisecond reprojection, which are enabled by improved compression.

The receiver 15 may be any known type of receiver for wired or wireless communications, including an Ethernet transmitter or a Bluetooth transmitter.

The decoder 16 is configured to receive and decode an encoded frame. The decoder may be implemented using executable software or may be implemented on specific hardware such as an ASIC.

The display device 17 may for example be a television screen or a VR headset. The timing of the display may be linked to a configured frame rate, such that the display device may wait before displaying the image. The display device may be configured to perform warping, that is, to obtain a final display window location, adjust a warpable image to obtain a final image corresponding to a final viewing direction of the user, and display the final image.

In this regard, the image data is typically arranged to provide a warpable image for which a portion of the image that is displayed at the display device 17 is dependent on a position or orientation of a viewer. The warpable image may then be rendered before a most up to date viewing direction of the user is known. The warpable image may be transmitted to the display device, or the warpable image may be transmitted to a rendering node which is near to the display device, and the display device or rendering node may perform time warping to generate a displayed image portion based on the warpable image and the most up to date viewing direction of the user.

As mentioned above, a single device may provide a plurality of the described components. For example, a first rendering node may comprise the image generator 11 , encoder 12 and transmitter 13. Additional similar rendering nodes may be included in the system, and may work together to generate the sequence of frames.

In one case, multiple rendering nodes may each provide separate image data to an image data assembling node; for example, each rendering node may provide a part of a sequence of frames to a frame assembling node.

For example, the receiver 15, decoder 16 or display device 17 may be configured to assemble parts of image data from multiple sources to generate a sequence of images for display on the display device.

Alternatively, the image data assembling node may be separate from the receiver 15, decoder 16 and display device 17.

Additionally or alternatively, multiple rendering nodes may be chained. In otherwords, successive rendering nodes may add to a sequence of image data as it passes from rendering node to rendering node, and eventually a complete sequence of image data is then provided to the receiver 15. Furthermore, each rendering node may obtain components of a render from multiple upstream rendering nodes and/or distribute components of a render to multiple downstream rendering nodes.

A chain of rendering nodes may be useful for performing different rendering tasks that require different quantities of processing resources, or different frame rates. For example, a company may provide distributed processing in the form of a centralised hub which has abundant processing resources but is distant from users, and peripheral locations which have more scarce processing resources but are closer to users. Expensive but fairly static rendering features such as background lighting or environmental impact on sound may be generated at the central hub (for example using ray tracing), while features that require fewer resources but faster responses or higher frame rates may be generated closer to the user. In other words, the more responsive a rendering feature needs to be, the lower latency it needs between the rendering node which generates the feature and the user display and, in a chain of rendering nodes, the node which generates each rendering feature can be chosen based on a required maximum latency of that feature. On the other hand, if it is expensive to generate a rendering feature, then it may be preferable to generate the feature less frequency and with a higher maximum latency. For example, a static, high-quality background feature may be generated early in the chain of rendering nodes and a dynamic, but potentially lower-quality, foreground feature may be generated later in the chain of rendering nodes, closer to the user device. Here, environmental impact on sound means, for example, a set of surfaces may be constructed where each surface has different sound reflection and absorption properties depending upon material and shape. The frame rates may be matched by creating multiple frames with features generated at the lower frame rate, and combining them with the frames with features generated at the higher frame rate. In a nonlimiting embodiment, a preliminary rendering generates volumetric object data including motion vectors at a first (lowest) frame rate, then produces 2D rendered frames plus depth information for a specific user at a second (higher) frame rate, then transmits video plus depth data to the user device, which produces final frames for display via space warping (depth-based reprojections) at a third (highest) frame rate. One or more of these steps may be performed in combination with the other described embodiments. The viewing position of the user may change as additional rendering tasks are performed at different rendering nodes in the chain. Each or any rendering node may obtain an updated viewing position before performing its respective rendering task. Additionally, the system may simultaneously generate multiple sequences of image data for different respective users or different respective display devices. For example, in the context of a VR or AR experience, each user or display device may view a different 3D environment, or may view different parts of a same 3D environment. When using a chain of rendering nodes, each node may serve multiple users or just one user.

For example, a starting rendering node (e.g. at a centralised hub) may serve a large group of users. For example, the group of users may be viewing nearby parts of a same 3D environment. In this case, the starting node may render a wide zone of view (“field of view”) which is relevant for all users in the large group.

The starting node may send this wide field of view to a first middle rendering node which renders additional aspects of the 3D environment. These additional aspects may for example be aspects which require less processing power to render, or may be aspects which are specific to individual users of the group. Additionally, the middle rendering node may render features in a smaller field of view than the starting node - this smaller field of view may be relevant to each user rather than the group of users. The first middle rendering node may additionally only serve a smaller number of users (e.g. half of the large group of users), with the remaining users being served by a second middle rendering node which also receives the wide field of view from the starting node.

The middle rendering node(s) may then send sequences of second partially or fully rendered frames to an end device for each user. The end device may perform further processes such as warping or focal distance adjustments, optionally using depth map data.

Preferably, each rendering node encodes the partially or fully rendered frames before transmitting them on to a next rendering node or to the receiver 15. This means that the required communication resources can be reduced when the rendering nodes are separated by one or more networks, or more generally are implemented in a distributed system such as a cloud.

However, each rendering node in a chain is encoding a different partially or fully rendered frame, with different data. Therefore, it may be advantageous for different rendering nodes to use different rendering formats and/or encoding formats. For example, the output from a first rendering node may be point cloud data which logically describes a 3D scene. This point cloud data can be encoded using the techniques of EP21386059.6. A second rendering node may then operate on the point cloud data to generate image data that is more readily displayed by a generic display device, without requiring the display device to model the 3D environment. This image data may be encoded using video coding techniques.

The chaining of rendering nodes may be extended to arbitrary tree structures, where a rendering node obtains partially rendered frames from more than one preceding rendering node, and generates further partially or fully rendered frames based on the multiple obtained sequences of partially rendered frames.

For example, a content rendering network (CRN) comprising numerous rendering nodes may be used to serve a volumetric event to a large number of same-time users, such as users participating in a shared virtual environment. Rendering the same event for each user is far more expensive in terms of computation time and power consumption than rendering the volumetric effect once and performing the rendering equivalent of multicasting the volumetric effect for multiple users. For example, each user may have a second rendering node (such as a VR headset), and the network may comprise a central first rendering node. The first rendering node may render the volumetric event, and distribute partially rendered frames depicting the volumetric event to the different second rendering nodes. The second rendering node for each user may then integrate the partially rendered frames depicting the volumetric event into a view of the virtual environment which is currently being shown to each user, based on parameters such as the user’s virtual position. The receiver 15, decoder 16 and display device 17 may be consolidated into a single device, or may be separated into two or more devices. For example, some VR headset systems comprise a base unit and a headset unit which communicate with each other. The receiver 15 and decoder 16 may be incorporated into such a base unit.

In some embodiments, the network 14 may be omitted. For example, a home display system may comprise a base unit configured as an image source, and a portable display unit comprising the display device 17.

In the event that the decoder 16 or the display device 17 does not or cannot handle one or more layers, the receiver 15 or another transmitter associated with the decoder or display device may send a corresponding layer drop indication back through the network 14. The layer drop indication may be received by each rendering node. A rendering node which generates partially or fully rendered frames for that specific decoder or display device may cease generating the dropped layer. On the other hand, a rendering node which generates partially or fully rendered frames for multiple end devices may disregard a layer drop indication received from one end device (as the dropped layer is still needed for other devices). Alternatively, rendering nodes which serve multiple end devices may record received layer drop indications, and may cease generating the dropped layer only when all end devices served by the rendering node indicate that the layer is to be dropped.

In preferred examples, the encoders or decoders are part of a tier-based hierarchical coding scheme or format. Hierarchical coding enables frames to be communicated with higher resolution and/or higher frame rate than is possible in single-tier coding schemes. In hierarchical coding, one or more enhancement layers is communicated with base data, where the enhancement layers can be used to up-sample the base data at the decoder, for example providing up-sampling in a spatial ortemporal dimension. When combined with equivalent down-sampling of the original frames and generation of the enhancement layer at an encoder, hierarchical coding can overall provide lossless compression of data, with higher resolution and/or higher frame rate for a given transmission bit rate. Examples of a tier-based hierarchical coding scheme include LCEVC: MPEG-5 Part 2 LCEVC (“Low Complexity Enhancement Video Coding”) and VC-6: SMPTE VC-6 ST-2117, the former being described in PCT/GB2020/050695, published as WO 2020/188273, (and the associated standard document) and the latter being described in PCT/GB2018/053552, published as WO 2019/111010, (and the associated standard document), all of which are incorporated by reference herein. However, the concepts illustrated herein need not be limited to these specific hierarchical coding schemes.

A further example is described in WO2018/046940, which is incorporated by reference herein. In this example, a set of residuals are encoded relative to the residuals stored in a temporal buffer.

LCEVC (Low-Complexity Enhancement Video Coding) is a standardised coding method set out in standard specification documents including the Text of ISO/IEC 23094-2 Ed 1 Low Complexity Enhancement Video Coding published in November 2021 , which is incorporated by reference herein.

The system describes above is suitable for generating and presenting a representation of a scene, where this scene displays media content to a user. The scene typically comprises an environment, where the user is able to move (e.g. to move their head or to turn their head) to look around the environment and/or to move around the environment. For example, the scene may be a scene of a room in a building, where the user is able to move around the room (e.g. by moving in the real-world and/or by providing an input to a user interface) in orderto inspect various parts of the room. Typically, the scene is a XR (e.g. a VR) scene, where the user is able to move about the scene in three degrees of freedom (3DoF) or six degrees of freedom (6DoF) so as to experience the scene.

As has been described with reference to Figure 1 , the image generator 11 may be arranged to determine point cloud data, where each point of the point cloud has a 3D position and one or more attributes. More generally, the image generator (or another component) is arranged to determine a three-dimensional representation of a scene, where this three-dimensional representation is thereafter used to generate two- dimensional images that are presented to a user at the display device 17.

While the points are typically points of a point cloud, more generally the disclosure extends to any point that is associated with a location and a value. Therefore, the points may, more generally, be considered to be data (or datapoints), which data is associated with a location and a value, and the ‘points’ may comprise polygons, planes (regular or irregular), Gaussian splats, etc.

Referring to Figure 3, there is described a method of determining (an attribute for) a point of such a three- dimensional representation. The method comprises determining the attribute using a capture device, such as a camera or a scanner. The scene may comprise a real scene, in which attribute values are captured using a camera, or a virtual scene (e.g. a three-dimensional model of a scene), in which attribute values are captured using a virtual scanner.

Where this disclosure describes ‘determining a point’ it will be understood that this generally refers to determining a point that has a location and an attribute value, where determining the point comprises determining the attribute value and/or storing a point that comprises at least an attribute value and a location value (these values may be indirect values, e.g. where the location is identified relative to another point). Once a plurality of points have been captured, these points can be stored as a three-dimensional representation (e.g. a point cloud) so as to enable the reconstruction of the three-dimensional scene based no this representation.

Typically, the scene comprises a simulated scene that exists only on a computer. Such a scene may, for example, be generated using software such as the Maya software produced by Autodesk®. The attributes determined using the methods described herein may then depend on virtual objects located within the scene as well as a virtual lighting arrangement used in the scene.

In a first step 31 , a computer device initiates a capture process for a capture device, the capture process being initiated with an initial azimuth angle (e.g. of 0°) and an initial elevation angle (e.g. of 0°).

In a second step 32, the computer device causes a point to be captured using the capture device at the current azimuth angle and current elevation angle. Capturing a point typically comprises assigning an attribute value to the point, which attribute value may, for example, be a colour of the point and/or a transparency value of the point. Typically, the point has one or more colour values associated with each of a left eye and a right eye of a viewer. Capturing the point may also comprise determining a normal value associated with the point, e.g. a normal of a surface on which the point lies. Typically, capturing the point further comprises determining a location of the point, e.g. by determining a distance of the point from the camera.

In practice, determining the point may comprise sending a ‘ray’ from the capture device and then stepping through a computer model to determine which surface of the computer model is impacted by the ray. The colour, transparency, and normal of this surface are then recorded alongside the distance of the surface from the capture device.

In a third step, 33, the computer device determines whether a point has been captured for the capture device at each azimuth of a range of azimuths and in a fourth step 34, if points have not been captured at each azimuth, then the azimuth angle is incremented and the method returns to the second step 32 and another point is captured. The azimuth angle may, for example, be incremented by between 0.01 ° and 1 ° and/or by between 0.025° and 0.1 °. Typically, the range of azimuth angles is selected to be 360° (i.e. so that the capture device captures points surrounding the entirety of the capture device), but it will be appreciated that other ranges are possible.

Once a point has been captured for each azimuth, in a fifth step 35, the computer device determines whether a point has been captured for the capture device at each elevation of a range of elevations and in a sixth step 36, if points have not been captured at each elevation, then the azimuth angle is reset to the initial value, elevation angle is incremented and the method returns to the second step 32 and another point is captured. The elevation angles may, for example, be incremented by between 0.01 ° and 1 ° and/or by between 0.025° and 0.1 °. Typically, the range of elevation angles is selected to be 360° (i.e. so that the capture device captures points surrounding the entirety of the capture device), but it will be appreciated that other ranges are possible.

In a seventh step 37, once points have been captured for each azimuth angle and each elevation angle, the scanning process ends.

This method enables a capture device to capture points at a range of elevation and azimuth angles. This point data is typically stored in a matrix. The point data may then be used to provide a representation of the scene to a user, e.g. the three-dimensional representation formed by the point data may be processed to produce two-dimensional images for each eye of a user, with these images then being shown to a user via the display device 17 to provide a virtual reality experience to the viewer. By using the captured data, a video can be provided to a viewer that enables the viewer to move their head to look around the scene (while remaining at the location of the capture device).

It will be appreciated that the capture pattern (or scanning pattern) described with reference to Figure 3 is purely exemplary and that numerous capture patterns are possible. In general, the capture process for each capture device comprises capturing one or more points at one or more azimuth angles and/or one or more elevation angles.

The ‘points’ captured by the capture device are typically associated with a size, such as a height, a width, or a depth. That is, the points typically relate to two-dimensional planes/pixels and/or three-dimensional voxels. In this regard, there is necessarily some space between the locations of adjacent points (since if the points had no width, then an infinite number of points would be required to capture points at each angle). The size provides points that depict a non-negligible area of the three-dimensional space so that a plurality of points can be fit together to provide a depiction of the scene to a viewer.

The width and height of each point is typically dependent on the distance of that point from the capture device, where more distant points have a larger width/height. The width and height of each point is typically determined so that when each point is displayed, there is no space between adjacent points (indeed, there may be some overlap between points to ensure that no gaps appear between points). This height/width of each point can be determined at the time of capturing the points, or can be determined or defined after the capture of the points.

Typically, the points comprise a size value, which is stored as a part of the point data. For example, the points may be stored with a width value and/or a height value. Typically, the minimum width and the minimum height of a point are set by the angle increment of the azimuth angle and the elevation angle respectively. The size may be then specified in terms of this angle increment and/or in terms of this minimum width/minimum height (e.g. as being a multiple of the angle increment). In some embodiments, the size value is stored as an index, which index relates to a known list of sizes (e.g. if the size may be any of 1x1 , 2x1 , 1x2, 2x2, pixels this may be specified by using 3 bits and a list that relates each combination of bits to a size). The size may be stored based on an underscan value. In this regard, where an object is very near to the viewing zone it may be captured using an unnecessarily dense arrangement of points. Therefore, certain surfaces or areas of the representation may be associated with an underscan value, which underscan value defines a reduction in the number of points captured as compared to a representation without underscan. The size of the points may be defined so as to indicate this underscan value. In an exemplary embodiment, the underscan value is an integer value between 0 and 3 and the size is stored as a combination of point dimensions (e.g. a width in the range [0,2]) and a height in the range ([0,2]) and an underscan factor (e.g. an underscan factor in the range [0,3]). In some embodiments, the width and the height are dependent on the underscan factor. For example, when the underscan factor exceeds a threshold value, the possible height and width values may be limited. In a specific example, when the underscan factor is 3, the width and the height may be limited to the range [0,1]. The size may then be defined as size = underscan*9 + height*3 + width. Such a method provides efficient storage and indication of width, height, and underscan values.

As shown in Figure 4a, typically, for each capture step (e.g. each azimuth angle and/or each elevation angle), a plurality of sub-points SP1 , SP2, SP3, SP4, SP5 is determined. For example, where the azimuth angle increment is 0.1 ° then for an azimuth angle of 0°, sub-points may be determined at azimuth angles of -0.05°, -0.025°, 0, 0.025°, and 0.05° (and similar sub-points may be determined for a plurality of elevation angles). Attribute values of these sub-points may then be combined to obtain an attribute value for the point. For example, a maximum attribute value of the sub-points may be used as the value for the point, an average attribute value of the sub-points may be used as the value forthe point, and/or a weighted average of the sub-points may be used as the value forthe point. It will be appreciated that numerous other methods for combining the attribute values of the sub-points are possible.

By determining the attribute of a point based on the attributes of sub-points, the accuracy of the capture process can be increased. While it would be possible to simply reduce the increment of the angle steps to provide a higher resolution scene, by considering sub-points but only storing attributes for points, a balance can be struck between accuracy and file size (since storing every sub-point would lead to a substantial increase in the amount of data that needs storing).

With the example of Figure 4a, for each point of the three-dimensional representation that is captured by a capture device, this capture device may obtain attributes associated with each of the sub-points SP1 , SP2, SP3, SP4, SP5, combine these attributes to obtain a point attribute, and then store a point with a distance that is an average (e.g. a weighted average) of the distances of the sub-points from the capture device, at the nominal angle of the point, with the point attribute.

As shown in Figure 4b, where a plurality of sub-points SP1 , SP2, SP3, SP4, SP5 are considered, these points may have different distances from the location of the capture device. In some embodiments, the attributes of the sub-points may be combined in dependence on this distance, e.g. so that sub-points nearer to the capture device have higher weightings.

However, the possibility of sub-points with substantially different distances raises a potential problem. Typically, in order to determine a distance for a point, the distances for the sub-points are averaged. But where the sub-points have substantially different distances and/or are related to different surfaces in the scene, this may result in the point having a distance that does not correspond to any actual surface in the scene. Therefore, the point may seem to hang in space (e.g. to hang between the front and rear surfaces shown in Figure 4b.

Similarly, where the attribute values of the sub-points greatly differ, e.g. if the sub-points SP1 and SP2 are white in colour and the sub-points SP3 and SP4 are black in colour, then the attribute value of the point may be substantially different to the attribute value of other points in the scene. In an example, if the scene were composed of black and white objects, the point may appear as a grey point hanging in space between these objects.

In some embodiments, the computer device is arranged to aggregate sub-points so as not to create any floating points. For example, the computer device may determine whether the sub-points are spatially coherent by employing a clustering algorithm (e.g. a k-means clustering algorithm). Where the sub-points are spatially coherent (e.g. where a difference in the distance of the sub-points is below a threshold value), these distances may be averaged to obtain a distance for the point. Where the sub-points are not spatially coherent, the sub-points may be processed to ensure that the distance of any point places it upon a surface; for example, in the system of Figure 4b, sub-points SP1 , SP2, and SP3 may be grouped into a first point and sub-points SP4 and SP5 may be grouped into a second point. Since each sub-point is associated with the same capture device and capture angle (all of these sub-points being associated with a capture step that has a particular azimuth angle and elevation angle), these points may be located at the same angle with respect to a capture device. Therefore, to ensure that each sub-point affects the representation considered, the first point (made up of sub-points SP1 , SP2, and SP3) may have a smaller distance value than the second point (made up of sub-points SP4 and SP5) and the first point may be assigned a nonzero transparency value so that the second point can be seen through the first point.

By capturing points at a plurality of azimuth angles and elevation angles, e.g. using the method described with reference to Figure 3, it is possible to provide a three-dimensional representation of the scene that can later be used to enable a viewer to view the scene from a plurality of angles. More specifically, given the three-dimensional points captured by the capture device, a computer device is able to render a two- dimensional representation (e.g. a two-dimensional image) of the scene for each eye of a viewer so as to provide a representation with an impression of depth. The computer device may render a series of two- dimensional representations to enable the viewer to look around the scene, where the two-dimensional representations are rendered based on an orientation of the viewer’s head. In this way, the determined representation is useable to provide, for example, a virtual reality (VR), mixed reality (MR), augmented reality (AR), and/or extended reality (XR) experience to the viewer.

To enable such a display, the display device 17 is typically a virtual reality headset, that comprises a plurality of sensors to track a head movement of the user. By tracking this head movement, the display device is able to update the images being displayed to the viewer as the viewer moves their head to look about the scene. Typically, this involves the display device sensing the sensor data to an external computer device (e.g. a computer connected to the display device via a wire). The external computer device may comprise powerful graphical processing units (GPUs) and/or computer processing units (CPUs) so that the external computer device is able to rapidly render appropriate two-dimensional images for the viewer based on the three-dimensional images and the sensor data.

In some embodiments, the external computer device may comprise a server device, where the display device 17 may be connected to this server device wirelessly. This enables the two-dimensional images to be streamed from the server to the display device so as to enable the display of high-quality images without the need for a viewer to purchase expensive computer equipment. In other words, operations that require large amounts of computing power, such as the rendering of two-dimensional images based on the three- dimensional representation, may be performed by the server, so that the display device is only required to perform relatively simple operations. This enables the experience to be provided to a wide range of viewers.

In some embodiments, a first two-dimensional image is provided to the display device 17 (and/or a connected device) and this first image is “warped’ in order to provide an image for viewing at the display device. The warping of the image comprises processing the image based on the sensor data in order to provide an image that matches a current viewpoint of the viewer. By performing the warping at the display device or another local device, the lag between a head movement of the user and an updating of the two- dimensional representation of the scene can be reduced.

One issue with the above-described method of capturing a three-dimensional representation is that it only enables a viewer to make rotational movements. That is, since the points are captured using a single capture device at a single capture location, there is no possibility of enabling translational movements of a viewer through a scene. This inability to move translationally can induce motion sickness within a viewer, can reduce a degree of immersion of the viewer, and can reduce the viewer’s enjoyment of the scene.

Therefore, it is desirable to enable translational movements through the scene. To enable such movements, the three-dimensional representation of the scene may be captured using a plurality of capture devices placed at different locations (or the same capture device placed at different locations). A viewer is then able to move around the scene translationally (e.g. by moving between these locations).

More generally, by capturing points for every possible surface that might be viewed by a viewer, a three- dimensional representation of a scene may be captured that allows a suitable two-dimensional representation of this scene to be rendered regardless of a location of a viewer (e.g. regardless of where a user is standing within a virtual room).

This need to capture points for every possible surface (so as to enable movement about a scene) greatly increases the amount of data that needs to be stored to form the three-dimensional representation.

Therefore, as has been described in the application WO 2016/061640 A1 , which is hereby incorporated by reference, the three-dimensional representation may be associated with a viewing zone, or a zone of viewpoints (ZVP), where the three-dimensional representation is arranged to enable a user to move about the viewing zone so as to view the scene.

Figure 5 illustrates such a viewing zone 1 and illustrates how the use of a viewing zone limits the amount of image data that needs to be stored to provide a three-dimensional representation of the scene. With the scene shown in this figure, and the viewing zone 1 shown in this figure, it is not necessary to determine attribute data for the occluded surface 2 since this occluded surface cannot be viewed from any point in the viewing zone. Therefore, by enabling the user to only move within the viewing zone (as opposed to around the whole scene) the amount of data needed to depict the scene is greatly reduced.

While Figure 5 shows a two-dimensional viewing zone, it will be appreciated that in practice the viewing zone 1 is typically a three-dimensional zone or volume.

The viewing zone 1 may, for example, comprise a rectangular volume, or a rectangular parallelepiped, and the viewing zone may have a height of at least 30 cm, a depth of at least 30 cm, and/or a width of at least 30 cm, where these dimensions enable a user to move their head while remaining in the viewing zone. This is merely an exemplary arrangement ofthe viewing zone; it will be appreciated that viewing zones of various shapes and sizes may be used (e.g. spherical viewing zones). That being said, it is preferable that the viewing zone is limited so as to cover only a part of the volume of the scene, e.g. no more than 50% of the scene no more than 25% of the scene, and/or no more than 10% of the scene. In this regard, if the viewing zone is the same size as the scene, then the three-dimensional representation will simply be a standard representation for virtual reality (that enables a user to move freely about the scene) - and so the use of the viewing zone will not provide any reduction in file size.

The viewing zone 1 enables movement of a viewer around (a portion of) the scene. For example, where the scene is a room, the base representation may enable a user to walk around the room so as to view the room from different angles. In particular, the viewing zone enables a user to move through the scene with six degrees-of-freedom (6DoF) movement through the scene, where this aids in the provision of an immersive experience.

In some embodiments, the viewing zone 1 may be four-dimensional, where a three-dimensional location of the viewing zone changes over time - and in such embodiments the size and location of the occluded surface 2 may also change over time. More generally, it will be appreciated that viewing zones may be formed in any size or shape, with different sizes and shapes being suitable for different scenes.

The volume of the viewing zone 1 is typically selected so that a user is able to move to a degree sufficient to avoid motion sickness and to provide an immersive sensation, while still only enabling a limited amount of movement (where this leads to a smaller file size as compared to an implementation where a user is able to fully move about the scene). Typically, the viewing zone is arranged to enable a user to move their head while they are sitting or standing, but not to freely roam around a room. The viewing zone 1 may have a (e.g. real-world) volume of less than five cubic metres (5m³), less than one cubic metre (1 m³), less than one-tenth of a cubic metre (0.1 m³) and/or less than one-hundredth of a cubic metre (0.01 m³).

The viewing zone 1 may also have a minimum size, e.g. the viewing zone may have a volume of at least 1 % of the volume of the scene, at least 5% of the volume of the scene, and/or at least than 10% of the volume of the scene. Similarly, the viewing zone may have a volume of at least one-thousandth of a cubic metre (0.01 m³); at least one-hundredth of a cubic metre (0.01 m³); and/or at least one cubic metre (1 m³).

The ‘size’ of the viewing zone 1 typically relates to a size in the real world, where if the viewing zone has a length of one metre this means that a user is able to move one metre in the real world while staying within the viewing zone. The size of the viewing zone in the scene may be greater than, equal to, or less than the size of the viewing zone in the real world. For example, the viewing zone may scale a real-world distance so that moving one metre in the real world moves the user less than (or more than) one metre in the scene. This enables the scene to provide different perceptions to the user (e.g. to make the user feel larger or smaller than they are in real life). Similarly, the viewing zone may scale a real-world angle so that rotating one degree in the real world rotates the user less than (or more than) one degree in the scene.

Therefore, a viewing zone with a volume of one cubic metre typically connotes a viewing zone in which the user is able to move about a one cubic metre volume in the real world while remaining in the viewing zone. And this may cause the user to move about a volume that is more than, or less than, one metre in the scene.

Referring to Figure 6a, in order to capture points for each surface and location that is visible from the viewing zone 1 , a plurality of capture devices C1 , C2 C9 may be used (e.g. a plurality of virtual scanners and/or a plurality of cameras). Each capture device is typically arranged to perform a capture process, e.g. as described with reference to Figure 3, in which the capture device captures points at a plurality of azimuth angles and elevation angles. By locating the capture devices appropriately, e.g. by locating a capture device at each corner of the viewing zone, it can be ensured that most (or all) points of a scene are captured.

Typically, a first capture device C1 is located at a centrepoint of the viewing zone 1. In various embodiments, one or more capture devices C2, C3, C4, C5 may be located at the centre ef faces of the viewing zone; and/or one or more capture devices C6, C7, C8, C9 may be located at edges of and/or corners of the viewing zone.

Figure 6a shows a two-dimensional view (e.g. a plan view) of a rectangular viewing zone. It will be appreciated that within this viewing zone each capture device may be located on a shared plane. Equally, the various capture devices may be located on different planes. Referring, for example, to Figure 6b, there is shown a three-dimensional view of a cuboid viewing zone, where there is a capture device located: at the centre of the viewing zone; at the centre of each face of the viewing zone; and at each corner of the viewing zone.

With this arrangement, many locations in the scene (e.g. specific surfaces) will be captured by a plurality of capture devices so that there will be overlapping points relating to different capture devices. This is shown in Figure 7, which shows a first point P1 being captured by each of a first capture device C1 , a sixth capture device C6, and a seventh capture device C7. Each capture device captures this point at a different angle and distance and may be considered to capture a different ‘version’ of the point.

Typically, only a single version of the point is stored, where this version may be the highest quality version of the point and/or may be the version of the point associated with the nearest and/or least angled capture device.

The highest ‘quality’ version of the point is typically captured by the capture device with the smallest distance and smallest angle to the point (e.g. the smallest solid angle). In this regard, as described with reference to Figures 4a and 4b, capturing a point for a given azimuth angle and elevation angle typically comprises capturing a plurality of sub-points at varying sub-point azimuth and elevation angles spread around the point azimuth and elevation angles. Due to the different spreads of sub-points, each capture device will capture a different version of the point (that has a different attribute) even when the points are at the same location. Capture devices that are close to the point and less angled with respect to the point typically have a smaller spread of sub-points and so typically obtain a version of a point that is sharper than a version of that point captured by more distant capture devices.

In some embodiments, a quality value of a version of the point is determined based on the spread of subpoints associated with this version (e.g. based on the perimeter formed by these sub-points and/or based on a surface area or volume bounded by these sub-points). The version of the point that is stored may depend on the respective quality values of possible versions of the points.

Regarding the ‘versions’ of the points, it will be appreciated that two ‘points’ in approximately the same location captured by each capture device may not have exactly the same location in the three-dimensional representation. More specifically, since each capture device typically projects a ‘ray’ at a given angle, the rays of differing capture devices may contact the surface at different locations for each capture device. Two points may be considered to be two ‘versions’ of a single point when they are within a certain proximity, e.g. a threshold proximity. For example, where the first capture device C1 captures a first point and a second point at subsequent azimuth angles, and the sixth capture device C6 captures a further point that is in between the locations of the first point and the second point, this further point may be considered to be a ‘version’ of one of the first point and the second point.

This difference in the points captured by different capture devices is illustrated by Figures 8a and 8b, which show the separate captured grids that are formed by two different capture devices. As shown by these figures, each capture device will capture a slightly different ‘version’ of a point at a given location and these captured points will have different sizes. Each capture step is associated with a particular range of angles (e.g. a nominal capture angle of 1 ° might encompass angles from 0.9° to 1.1 °), and therefore capture devices that are far from a point to be captured represent a wider region at the capture distance than capture devices closer to that point to be captured. As shown in Figure 8a, the capture device C1 would capture the points P1 and P2 in separate brackets, whereas for the capture device C2 these points are in the same bracket. Therefore, the capture device C2 might determine a single point that encompasses both points P1 and P2, whereas the capture device C1 would determine separate points for these two points.

Considering then a situation in which points P1 and P2 are captured separately, and capture device C1 is used to capture point P1 while capture device C2 being used to capture point P2, it should be apparent that the ‘sizes’ of these captured points, and the locations in space that are encompassed by the captured points will be based on different grids. For example, the width of the captured point P2 captured by the capture device C2 will be larger than the width of the captured point P1 captured by the capture device C1. The capture process may be determined based on the existence of these different grids, and on the different bracket widths that occur at different distances from a capture device.

Figure 8a shows an exaggerated difference between grids for the sake of illustration. Figure 8b shows a more realistic embodiment in which the three-dimensional representation comprises a plurality of points associated with different capture devices, where these points lie on different grids associated with these different capture devices.

In order to store the points of the three-dimensional representation, the points may be stored as a string of bits, where a first portion of the string indicates a location of the point (e.g. using x, y, z coordinates) and a second portion ofthe string locates an attribute ofthe point. In various embodiments, further portions of the string may be used to indicate, for example, a transparency of the point, a size of the point, and/or a shape of the point. A computer device that processes the three-dimensional representation after the generation of this representation is then able to determine the location and attribute of each point so as to recreate the scene. This location and attribute may then be used to render a two-dimensional representation of the scene that can be displayed to a viewer wearing the display device 17. Specifically, the locations and attributes of the points of the three-dimensional representation can be used to render a two-dimensional image for each of the left eye of the viewer and the right eye of the viewer so as to provide an immersive extended reality (XR) experience to the viewer.

The present disclosure considers an efficient method of storing the locations of the points (e.g. at an encoder) and of determining the locations of the points (e.g. at a decoder).

As has been described with reference to Figures 5a and 5b, the points of the three-dimensional representation are determined using a set of capture devices placed at locations about the viewing zone, where these capture devices are arranged to capture points at a series of azimuth angles and elevation angles. Typically, each of the capture devices is arranged to use the same capture process (e.g. the same series of azimuth angles and elevation angles), though it will be appreciated that different series of capture angles are possible. For example, there may be a plurality of possible series of capture angles, where different capture devices use different capture angles.

In general, the present disclosure considers a method in which points are stored based on a capture device identifier and an indication of a distance of the point from the capture device associated with this capture device identifier. Typically, the point is also associated with an angular indicator, which indicates an azimuth angle and/or an elevation angle of the point relative to the identified capture device.

It will be appreciated that the storage of the distance and the angle may take many forms. For example, the distance and the angle of each point may be converted into a universal coordinate system, where each capture device has a different location in this universal coordinate system. In particular, each point may be stored with reference to a centre of this universal coordinate system, which centre may be co-located with a central capture device. Where a point is determined based on a distance and an angle from a capture device of a known location in this universal coordinate system, the coordinates of the point in this universal coordinate system can be determined trivially - and the location of the point may then be stored either relative to the capture device or as a coordinate in the universal coordinate system.

The capture device identifier may comprise a location of a capture device (e.g. a location in a co-ordinate system of the three-dimensional representation). Equally, the capture device identifier may comprise an index of a capture device. Similarly, the indication of the azimuth angle and the elevation angle for a point may comprise an angle with reference to a zero-angle of a co-ordinate system of the three-dimensional representation. Equally, the azimuth angle and/or the elevation angle may be indicated using an angle index.

In some embodiments, the three-dimensional representation is associated with configuration information, which configuration information comprises one or more of: a set of capture device indexes; locations associated with the capture devices and/or the capture device indexes; a spacing of capture devices (e.g. so that locations of the capture devices can be determined from a location of a first capture device and the spacing); angles associated with a capture process for the capture devices; an azimuth angle increment and/or an elevation angle increment associated with the capture process; and a set of angle indexes (e.g. to match an angle index to an angle).

With this configuration information, it is possible to determine a location of each capture device from an index of that capture device and/or to determine a capture angle from a known capture process. Therefore, given two numbers: a capture device index and an angle index (that is associated with a combination of a specific azimuth angle and a specific elevation angle), a location of a capture device and a direction of a point from this capture device can be determined. By also signalling a distance of the point from the signalled capture device, a precise location of the point in the three-dimensional space can be signalled efficiently.

Typically, the point is associated with each of: a camera index, a distance, an first angular index (e.g. a first azimuth), and a second angle (e.g. a second elevation)

This method of indicating a location of a point enables point locations to be identified using a much smaller number of bits than if each point location is identified using x, y, z coordinates.

A method of determining a location of a point may be carried out by a computer device, e.g. the image generator 11 and/or the decoder 15 and may comprise:

In a first step, the computer device identifies an indicator of a capture device used to capture the point. Typically, this comprises identifying a portion of a string of bits associated with a capture device index.

In a second step, the computer device identifies an indicator of an angle of the point from the capture device. Typically, this comprises identifying an angle index, e.g. an azimuth index and/or an elevation index and/or a combined azimuth/elevation index, which index(es) identifies a step of the capture process during which the point was captured.

In a third step, based on the identifiers, the computer device determines the location of the capture device and the angle of the point from the capture device.

The capture device identifier is typically a capture device index, which is related to a capture device location based on configuration information that has been sent before, or along with, the point data. For example, the configuration information may specify:

Location of first capture device is (0,0,0).

Step between capture devices is (0,0,1) along the grid, then across the grid, then up the grid.

- The grid is (10,10,10).

With this information, a capture device with an index of 1 can be determined to be located at (0,0,0); a capture device with an index of 5 can be determined to be located at (0,0,4); a capture device with an index of 12 can be determined to be located at (0,1 ,0), and so on.

Equally, the configuration information may specify a list of camera indexes and locations associated with these indexes, where this enables the use of a wide range of setups of capture devices.

Typically, the three-dimensional representation is associated with a frame of video. The configuration information may be constant over the frames of the video so that the configuration information needs to be signalled only once for an entire video. Therefore, the configuration information may be transmitted alongside a three-dimensional representation of a first frame of the video, with this same information being used for any subsequent frames (e.g. until updated configuration information is sent).

The angle identifier may similarly be related to an angle by a location and an increment that are signalled in a configuration file. For example, the configuration information may specify:

An azimuth increment and an elevation increment are each 1 °.

There are 359 increments for each angle type.

With this information: a capture angle with an index of 1 can be determined to be at an azimuth angle of 0° and an elevation angle of 0°; a capture angle with an index of 10 can be determined to be at an azimuth angle of 10° and an elevation angle of 0°; a capture angle with an index of 360 can be determined to be at an azimuth angle of 0° and an elevation angle of 1 °; and a capture angle with an index of 370 can be determined to be at an azimuth angle of 9° and an elevation angle of 1 °; etc. In a fourth step, based on the determined location of the capture device and the determined angle, a location of the point is determined. Typically, this comprises determining the location ofthe point based on the location of the capture device, the capture angle, and a distance of the point from the capture device (where this distance is specified in the point data for the point).

Determining the location of the point typically comprises determining the location of the point relative to a centrepoint of the three-dimensional representation. This location of the point may then be converted into a desired coordinate system and/orthe point may be processed based on its location (e.g. to stitch together adjacent points).

The angular identifier typically comprises a first angular identifier and a second angular identifier, where the first identifier provides the azimuthal angle of the point and the second identifier provides the elevation angle of the point.

Referring to Figure 9, each angular identifier may be provided as an index of a segment of the three- dimensional representation, where, for example, an index of 0 may identify the point as being in a first angular bracket 101 and an index of 1 may identify the point as being in a second angular bracket 102.

In this regard, the capture devices are arranged to perform a capture process, e.g. as described with reference to Figure 3, with a non-infinite angular resolution. Given this non-infinite resolution, each point is not a one-dimensional point located at a precise angle. Instead, each point is a point for a particular area of space, with the size of this area being dependent on the angular resolution as well as the distance of the point from the capture device. In other words, each capture angle determines a point for an angular range (with the range being dependent on the angular resolution). That is, if the capture process leads to points being captured at angles of 10°, 11 °, and 12° then this can equally be considered to relate to points being captured at a first range of 9.5°-10.5°, a second range of 10.5°-11 .5°, and a third range of 11 .5°-12.5°.

This is shown in Figure 9, which shows a series of angular brackets, with the size of these angular brackets at a given distance being dependent on the angular resolution. The angular identifier(s) typically comprise a reference to such an angular bracket. Consider, for example, a cube placed with the capture device C1 at the centre of this cube. By dividing this cube into x segments at regular azimuth angles and y segments at regular elevation angles, it is possible to identify any angular range ofthe representation by reference to an x segment and a y segment (and then the space bracketed by this angular range will depend on both the angular resolution (e.g. the angle between adjacent brackets) and the distance of the point from the capture device).

Typically, each capture device has the same capture pattern so that the angular bracketing of each device is the same (albeit centred differently at the location of the relevant capture device). For example, in an embodiment with 1000 equal angular brackets, the angle for each bracket may be 360/1000.

In some embodiments, different capture devices are associated with different capture patterns, where this may be signalled in configuration information relating to the three-dimensional representation.

In some embodiments, each capture device is arranged to capture a point for a plurality of angular brackets, where each bracket is associated with a different angle. The angular spread of each bracket (that is, the angle between a first, e.g. left, angular boundary of the bracket and a second, e.g. right, angular boundary of the bracket) may be the same; equally, this angular spread may vary. In particular, the angular spread may vary so as to be smaller for points which are directly in front of (or behind, or to a side of) the capture device. For example, the embodiment shown in Figure 10 shows an angular bracketing system that is based on a cube. With this system, a cube is placed such that a capture device is located at the centre of the cube and the cube is then split into 1000 sections of equal size (it will be appreciated that the use of 1000 sections is exemplary and any number of sections may be used). Each of these sections is then associated with an angular index. With this arrangement, the angular spread of each section (or bracket) varies, as has been described above.

Figure 9 shows a two-dimensional square, where each angular bracket of the square is referenced by an index number (between 1 and 100). In a three-dimensional implementation, an angular bracket of a cube could be indicated with two separate numbers (with a first azimuthal indicator that identifies a ‘column’ of the cube and a second elevational indicator that identifies a ‘row’ of the cube). Equally, a singular indicator may be provided that indicates a specific bracket of the cube. Therefore, for a cube that is divided into 1000 elevational sections and 1000 azimuthal sections, the bracket may be indicated with two separate indicators that are each between 0 and 999 or with a single indicator that is between 0 and 999999.

It will be appreciated that the use of a cube to define the brackets is exemplary and that other bracketing systems are possible. For example, a spherical bracketing system may be used (where this leads to curve angular brackets). Equally, a lookup table may be provided that relates angular indexes to angles, where this enables irregularly spaced brackets to be used.

Typically, determining the location of the point comprises determining the location of the point so as to be at the centre of the angular bracket identified by the angular identifier(s).

The display device 17 is arranged to display one or more two-dimensional images (hereafter termed one or more two-dimensional ‘immersive’ images) to a user in order to provide the impression of a three- dimensional scene. In particular, the display device may provide a first ‘immersive’ image for a first eye of a user and a second (different) ‘immersive’ image for a second eye of a user with the differences between the images providing the impression of depth to a viewer of the images (‘immersive’ is used here as a label to distinguish the aforementioned displayed two-dimensional images from other two-dimensional images). The immersive images are typically arranged to be viewed by the display device 17 (e.g. by a VR headset).

Typically, a computer device (e.g. the image generator 11 or the display device 17) is arranged to form the two-dimensional images based on a three-dimensional representation of the scene (e.g. based on a point cloud). Typically, the three-dimensional representation comprises a plurality of points, where each point has a location, an attribute value for a left eye, and an attribute value for a right eye. Based on a position of a viewer (e.g. in the viewing zone), the computer device is able to identify the points of the three- dimensional representation that are visible to the user and to form the two-dimensional immersive images for each eye based on the attribute values of these points and the locations of these points relative to the viewer.

Forming the two-dimensional immersive images in this way enables accurate images to be formed that enable the provision of an immersive scene to a user; however, negatively, this tends to require large amounts of computing power to identify, evaluate, and convert each of the relevant points in the representation. Therefore, it is desirable to identify methods that reduce the amount of computation necessary to form the two-dimensional immersive images from the three-dimensional representation.

In order to edit the two-dimensional immersive images that are provided to a viewer, an editor of the scene is typically able to: edit the three-dimensional representation (e.g. to change an attribute of a point); render an immersive two-dimensional image based on this three-dimensional representation; identify a change that they wish to make to the two-dimensional immersive image; modify the three-dimensional representation accordingly; and then render a further two-dimensional immersive image based on this modified three-dimensional representation to determine whether the modification has had the desired effect. This process (and the alternative methods described below) may be performed whether or not the three-dimensional representation includes a two-dimensional background image. In practice, it can be difficult for an editor to determine which points of the three-dimensional representation need to be edited and how those points need to be edited to obtain a desired effect so that the editing process can require a lengthy process of trial and error.

Therefore, the present disclosure considers a method of partially rendering the two-dimensional immersive image in such a way that a user is able to evaluate the effect of modifications to the three-dimensional representation without the need to wholly re-render the two-dimensional immersive image.

Referring to Figure 10, to enable such an evaluation to occur, the method of forming the two-dimensional immersive image may comprise a two-step process in which, in a first step 51 , a computer device (e.g. of the image generator 11) renders an intermediate two-dimensional immersive image; in a second step 52, the computer device identifies one or more rendering parameters; and in a third step 53, the computer device renders a final two-dimensional immersive image based on the intermediate image and the rendering parameters.

Typically, the rendering of the intermediate image comprises rendering a two-dimensional image based on the locations of the points in the three-dimensional representation. This rendering may also comprise rendering the two-dimensional image based on the attributes of the points in the three-dimensional representation.

Following this initial rendering step, the locations of the points in the three-dimensional representation may be fixed, with the attributes of these points still being modifiable. In practice, this may comprise maintaining a file that identifies the points of the three-dimensional representation that relate to the pixels or objects in the two-dimensional image.

A user is then able to edit the two-dimensional image, e.g. to modify a colour gamut of the image, to add shadows or effects, to alter the lighting properties, where the effect of these edits on the various pixels of the two-dimensional image can be identified and related to changes in the attribute values of the three- dimensional representation. In some embodiments a final value of some attributes (e.g. the colours of points) can be defined by mixing and taking into account several other attributes linked to the same point or to neighbouring points. For example, the specular and diffuse components of the points or attributes splitting the influences of particular lights in the original 3D scene. For example, the computer device may identify an edit made to the two-dimensional immersive image and then identify a change that would be required to a point of the three-dimensional representation in order to effect this change in a future rendering of a two-dimensional immersive image. In a simple example, a user may decrease a brightness of the two- dimensional image by 10%, and the computer device may identify that each point of the three-dimensional representation should similarly be reduced in brightness by 10% so that a future two-dimensional immersive image (that is rendered based on the three-dimensional representation) will correspond to the edited two- dimensional representation.

Where the three-dimensional representation contains, or is associated with, one or more two-dimensional background images, a first rendered image may be formed based on the three-dimensional representation and a second rendered image may be formed based on these background image(s). These rendered images may then be combined in the compositing step. Modifications may then be made to the rendered images separately. Typically, the rendering of the second rendered image (based on the background image) is much quicker than the rendering of the first rendered image and so this enables a user to modify and potentially re-render the second rendered image without needing to re-render the first rendered image.

The method described above enables a user to evaluate the effect of changes to the scene without entirely re-rendering a video of the scene (which video typically comprises a plurality oftwo-dimensional immersive images). By fixing the locations of the points and rendering the intermediate two-dimensional immersive image, a user is still able to make useful changes while limiting the amount of processing required as a result of these changes (since the computer device does not need to re-identify the points of the three- dimensional representation that will be included in a two-dimensional immersive image.

The rendering parameters may comprise the attribute values of one or more points. The rendering parameters may, additionally or alternatively, comprise one or more two-dimensional effects, which two- dimensional effects may comprise filters or distortions that are applied to the intermediate two-dimensional immersive image (e.g. to change a brightness, contrast, or colour gamut of the intermediate two- dimensional immersive image).

Referring to Figure 11 , there is described a detailed process for generating a two-dimensional immersive image based on a three-dimensional representation. This process is typically performed by a computer device (based on user inputs)

In a first ‘authoring’ step 61 , the computer device prepares the scene files required to render the scene. This typically involves the computer device identifying the three-dimensional representations that will be used to form the two-dimensional immersive images.

The authoring step 61 generally involves the creation and assembly of content that is intended for final output, encompassing a broad range of activities from initial design and development to final adjustments before the content is rendered and composited. For example, the authoring step may involve generating and/or modifying a three-dimensional representation prior to the creation of one or more two-dimensional immersive images based on this three-dimensional representation.

In a second ‘rendering’ step 62, the computer device creates two-dimensional images based on the scene files, which images can be provided to, or edited by, a user. It will be appreciated that these two-dimensional images may be presented on a display in such a way that they form the impression of a three-dimensional scene when viewed by a user.

The rendering step 62 may comprise generating a plurality of different images. For example, this step may comprise generating a first image based on the points of a three-dimensional representation and a second image based on a background image that is referenced within this three-dimensional representation or that is selected by a user.

The rendering step 62 may involve calculating light, shadow, texture, and colour information to create individual image frames from the three-dimensional representation. For example, a user may provide an input that identifies a light source and the computer device may process the three-dimensional representation based on this light source (e.g. by modifying the attribute values of points in the three- dimensional representation based on the light source). The rendering step can be time-consuming and resource-intensive since it often requires complex calculations to simulate realistic lighting and materials.

In many situations, a user may wish to evaluate the effects of various filters or inputs on the three- dimensional representation in order to form the scene. Such an evaluation may require re-rendering images based on a variety of inputs; however, this can take a prohibitive amount of time, especially if a user is making a number of small tweaks to an input in order to perfect a rendered image.

In a third ‘compositing’ step 63, the computer device combines the rendered images. For example, the computer device may impose a foreground image (e.g. formed from a three-dimensional representation) onto a background image (e.g. a two-dimensional background image).

The compositing step 63 typically involves combining a plurality of layers, where this combining may involve modifying the layers so as to blend the layers together. For example, a user may be able to define a contrast between various layers or may be able to alter a brightness or colour gamut of a layer in order to generate a final two-dimensional immersive image in which the layers are blended together in a desired way. The compositing step 63 typically involves compositing images formed from volumetric (e.g. three- dimensional) representations. Such compositing is different from film compositing that only considers two- dimensional images. In this regard, instead of simply layering a plurality of two-dimensional images, the compositing step 63 of the method of Figure 14 typically comprises integrating and aligning multiple three- dimensional elements within a virtual space. This may involve combining three-dimensional representations of real-world objects, CGI elements, and other data sources to create a unified 3D environment. Compositing in this context may involve spatial adjustments, ensuring that objects from different captures or renders correctly interact in terms of scale, lighting, and positional accuracy. Typically, this involves forming two-dimensional objects (e.g. pixels) based on three-dimensional representations and then combining these two-dimensional objects based on three-dimensional location information associated with the three-dimensional objects. More specifically, this may comprise generating a plurality of two- dimensional layers, where each layer is associated with a different depth in the three-dimensional representation and then combining the layers so as to provide a two-dimensional image that can represent a three-dimensional scene.

The relationship between the rendering step 62 and the compositing step 63 is typically sequential. That is, the method typically comprises rendering individual elements (e.g. characters, backgrounds, visual effects) in the rendering step and then, following this rendering step, combining the rendered elements in the compositing step. In the compositing step, the rendered elements may be layered and adjusted to ensure that they interact convincingly with each other. Compositing allows for the integration of various visual elements into one seamless visual output, adjusting for factors like depth, colour balance, and interaction of light among different components.

In many situations, the user may desire to modify the three-dimensional representation, for example to alter a lighting of the three-dimensional representation or to alter a colour of a point of the three-dimensional representation.

Not least because conventional image editing software is mostly arranged to operate on two-dimensional images, a user might wish to view and/or modify two-dimensional images that are generated based on the three-dimensional representation in such a way that any modifications made to the two-dimensional images cause modifications to the three-dimensional representation. For example, if a user edits a colour or a position of a visual element in one of the aforementioned composite images and decides that they prefer the new colour of the element, they may wish to similarly edit the colour of any associated points in the three-dimensional representation so that a future rendering of the three-dimensional representation will cause the element to appear with the edited colour. In particular, a modifier (e.g. a compositor) of a scene may wish to edit the three-dimensional representation before this three-dimensional representation is transmitted to viewers (which viewers will then render the three-dimensional representation for viewing).

As mentioned above, the compositor may wish to perform this modification on a rendered and/or composited image, since a two-dimensional image is easier to view and modify than the three-dimensional representation, but the compositor may then wish to modify the three-dimensional representation. This may require the compositor to: identify a desired change based on a composite image; enact a corresponding modification in the three-dimensional representation; form a new composite image based on the updated three-dimensional representation; and then ensure that the enacted modification has had the desired effect. That is, conventionally, when an compositor views a composite image and determines that modifications are required, the compositor may be required to re-start the process of Figure 14 and to make modifications to a three-dimensional representation during a new authoring step; since this process then involves rerendering images based on the three-dimensional representation, this process can be lengthy, unwieldy, and unpredictable (since it might be difficult for the compositor to determine which modifications they must make to the three-dimensional representation to achieve a desired effect in the two-dimensional images). Therefore, the present disclosure describes methods by which a three-dimensional representation can be more efficiently modified. In particular, the present disclosure describes methods by which modifications may be made to a two-dimensional image so as to cause a change in a three-dimensional representation, where the effect of these modifications can be viewed without the need to (completely) regenerate the three-dimensional representation from the initial scene. That is, the present disclosure describes methods by which a user can edit properties of points in the three-dimensional representation without needing to update an initial model on which the three-dimensional representation is determined. In a practical example, the three-dimensional representation may be captured from a 3D model of the scene that has a certain arrangement of lighting. In order to present a similar scene with a different arrangement of lighting, a user could edit the lighting in this 3D model and then re-generate the three-dimensional representation (using the capture process described above). However, this would typically take a large amount of time and processing power. Using the methods set out herein, the user is instead able to edit the lighting in the scene using post-processing software that considers a two-dimensional image. This modification process enables a userto modify the points of the three-dimensional representation, so that a complete re-generation ofthe three-dimensional representation is not necessary.

Of relevance to these methods, the compositing step 63 (or more generally a process of editing a two- dimensional image) may involve the use of arbitrary output values (AOVs), where a pixel of a rendered image may be associated with one or more AOVs. Each AOV can contain different data about the scene, such as a diffuse colour, specular highlights, shadows, reflections, etc. so that by using these AOV a user is able to cause a change in the two-dimensional image.

The method of Figure 11 may comprise associating one or more pixels of a rendered or composite image with such an arbitrary output value. This typically occurs following the second rendering step 62, where pixels from one or more rendered images are then associated with one or more AOVs. The third compositing step 63 is then dependent on the values of these AOVs.

These AOVs may be saved separately from a rendered or composite image and may be arranged to be adjusted individually without re-rendering the images that are used to form the composite image. This is particularly useful in complex scenes where tweaking individual elements precisely is necessary for achieving the desired final image.

By providing AOVs, e.g. by exporting different aspects of the lighting and material responses as separate AOVs, a compositor is able to fine-tune the appearance of each visual element of a two-dimensional image independently; such fine-tuning may make use of conventional post-production software such as Nuke™ or After Effects™. For example, using such software a user is able to adjust the intensity of reflections or to correct shadows in one part of an image without affecting other parts of the image.

Furthermore, AOVs help in troubleshooting and quality control by enabling an editor to isolate specific parts of a rendered image. Therefore, if a compositor identifies an issue with how light interacts with the surface of an object, this compositor can use a relevant AOV to analyze and correct this problem without needing to modify the original 3D model and to re-generate the three-dimensional representation.

In short, AOVs enable a user to modify a rendered image efficiently and in detail (e.g. focusing on a particular part of the image) and to view the effects of these modifications without needing to modify the original 3D model and to re-generate the three-dimensional representation.

In various embodiments, a computer device might provide separate AOVs for one or more points of a three- dimensional representation or one or more pixels of a rendered two-dimensional image, the AOVs providing information that defines one or more of: diffuse (the basic colour of surfaces without reflections or lighting); specular (the reflective highlights from surfaces); normals (information about surface angles, useful for relighting or adjusting effects based on the orientation of surfaces); and depth (how far elements are from the camera, this may be used for depth of field effects or atmospheric effects). When providing the final composite (immersive) two-dimensional image, the various AOVs may be combined in order to form the final image that appears in, e.g. a film or a game. Therefore, a user is able to separately combine AOVs in order to modify specific features of an image, and these AOVs can then be combined to form a final image.

Referring to Figure 12, in some embodiments, the second rendering step 62 comprises rendering 71 one or more two-dimensional layers (e.g. images) from a three-dimensional representation and associating 72 pixels from one or more of these planes with AOVs. In particular, a plurality of pixels (or layers) may be generated based on different sets of points in the three-dimensional scene, which sets of points are each associated with a different distance from a viewing zone.

For example, a first layer may be generated based on points that are in the range of 10m - 15m from the viewing zone, a second layer may be generated based on points that are in the range of 15m - 25m from the viewing zone etc. It will be appreciated that these distances are purely exemplary. The range of distances associated with each layer may depend on a distance of the layer from the viewing zone (e.g. where layers associated with points far from the viewing zone are associated with larger distance ranges than layers associated with points closed to the viewing zone.

A user is then able to separately combine and/or modify 73 the AOVs for these various layers so that, for example, a user can edit the appearance of objects close to the viewing zone without affecting the appearance of objects further from the viewing zone. The computer device can then identify 74 a layer that is associated with the combined AOVs and can identify the points of the three-dimensional representation that are associated with this layer.

The computer device may then modify 75 the one or more identified points of the three-dimensional representation based on the modifications made to the rendered layers. For example, if a characteristic (e.g. a colour) of a first layer is modified by a user, then the computer device may be arranged to identify one or more points associated with the first layer and to modify the attributes of the identified points based on the modifications made to the rendered layers.

One potential issue with this method of modification is that it can be difficult to modify the three-dimensional representation based on the modified layer. In this regard, once the layers are rendered, they may be functionally separate from the three-dimensional representation. Therefore, implementing the abovedescribed method may (but does not necessarily) involve storing correspondences between the rendered layers and points of the three-dimensional representation that are associated with each layer so that modifications made to a layer can be used to modify the three-dimensional representation.

Therefore, in some embodiments, the computer device is arranged to render one or more two-dimensional objects (e.g. pixels) based on points of the three-dimensional representation; to store a correspondence between the two-dimensional objects and the points; to associate one or more AOVs with the two- dimensional objects; to identify a modification to one or more of the AOVs; and, based on the correspondences and the identified modifications, to modify the corresponding points of the three- dimensional representation.

This method may be considered to involve generating a ‘proxy’ two-dimensional image based on a three- dimensional representation. The proxy image may, for example, comprise a 360 degree image of the three- dimensional representation. A user is then able to edit the proxy image using two-dimensional image processing software (e.g. by altering an AOV associated with a point of the image).

The three-dimensional representation can then be updated based on these edits. In particular, a correspondence between points of the three-dimensional representation and the pixels of the two- dimensional image is determined at the time of generation of the two-dimensional image. This correspondence is then used to identify a point of the three-dimensional representation that is associated with each edited pixel of the two-dimensional image). The points of the three-dimensional representation can then be modified based on edits made to the pixels of the two-dimensional proxy image.

Typically, this involves a compositor making a plurality of edits to the proxy image and then exporting these edits to the three-dimensional representation. That is, the compositor is not directly editing the points of the three-dimensional representation; instead, they are working on proxy data so that any edits made to the proxy data must be transformed (via the determined and stored correspondence) in order to apply these edits to the three-dimensional representation.

Referring to Figure 13, there is described another method by which a three-dimensional representation can be modified using compositing software. This process is typically performed by a computer device (based on user inputs). The method of Figure 16 enables a user to work directly (or near-directly) on the point data of the three-dimensional representation.

In a first step 81 , points from the three-dimensional representation are arranged in a two-dimensional array (e.g. where each point or ‘entry’ of the array is associated with a point of the three-dimensional representation). The entries of the array may each include one or more of: an indicator of a capture device used to capture the point; location information (e.g. in x, y, z coordinates or with reference to the capture device), a size, attribute values, a normal, a transparency, etc., where these properties are determined from a corresponding point of the three-dimensional representation.

In a second step 82, the entries of an array (and so, indirectly, the points of the three-dimensional representation) are each associated with one or more AOVs. In particular, the entries of the array may each be associated with one or more AOVs that indicate attribute data of the points. The aforementioned point information may be associated with the two-dimensional array via these AOVs. Therefore, for example, one or more of: an indicator of a capture device used to capture the point; location information (e.g. in x, y, z coordinates or with reference to the capture device), a size, attribute values, a normal, a transparency of each point may be indicated by an associated AOV. In this way, information relating to the three- dimensional representation can be effectively signalled in a two-dimensional image.

In particular, the computer device may generate one or more AOVs that relate to one or more components of point data associated with the points of the three-dimensional representation. The point data typically comprises a plurality of fields (e.g. a normal field, a left eye attribute field, a right eye attribute field, etc.) and an AOV may be generated for one or more of, or each of, the fields of the point data. Therefore, the AOV provides a direct link to the point data of the three-dimensional representation. A user is able to modify the attributes of the two-dimensional array representation using (e.g. conventional) compositing software. As described below, this two-dimensional array can then be “transformed” (quickly) into a 360 degree image and/or a two-dimensional image so that the user is able to quickly visualize the effect of these modifications.

In a third step 83, a (e.g. 360 degree and/or two-dimensional) image is generated (e.g. rendered) based on the array and the technical AOVs. Such an image can then be viewed by a user using existing processing software that is designed for two-dimensional images. This rendering of the image comprises transforming the entries of the two-dimensional array in order to generate a coherent image that can be displayed to the user.

As described in more detail below, the user is then able to modify the attributes of the two-dimensional array and/or the three-dimensional representation and to (almost immediately) view the effects of these updates in the form of updates on the two-dimensional image. This could be considered to be a user indirectly modifying the two-dimensional image, where the user may be able to use editing software that is designed for use with two-dimensional images, to view and interact with a two-dimensional image using this software, to make modifications to entries of the two-dimensional array using this editing software (where this may involve a user selecting an entry for editing by interacting with the two-dimensional image) and to then, in real-time, view the effect of these modifications by viewing a change that occurs to the two- dimensional image.

More specifically, in the second step 82, the computer device associates entries of the two-dimensional array (and thus, indirectly, points of the three-dimensional representation) with a first set of AOVs (referred to herein as ‘technical’ AOVs), where these technical AOVs contain information about the properties of the points, e.g. the locations of the points and the normals of the points.

This typically comprises forming the two-dimensional array based on the three-dimensional points, where each point is associated with an entry in the array and each entry is associated with one or more technical AOVs (which technical AOVs contain information relating to the three-dimensional points). This two- dimensional array may then be used as an input to two-dimensional editing software by providing the array as an input ‘image’ ). This two-dimensional array typically does not form a coherent image, so that while the array may be input to the two-dimensional image software to enable a user can modify the array, the two-dimensional array is typically not able to be directly displayed as a coherent image.

In other words, the entries of the array may be arranged into a two-dimensional image that shows, e.g., a colour of each entry, but these entries would typically not be arranged to be positioned in any meaningful way. The information that enables the proper positioning of these points is stored in the technical AOVs (e.g. the locations of the points and/or the distances of the points from a capture device) as opposed to being present in the two-dimensional image).

The computer device can then transform the entries of the two-dimensional array based on the technical AOVs in order to present a coherent two-dimensional image to a user.

Conventionally, AOVs contain information about modifications to the colours of points. The disclosure herein considers the use of AOVs to encode location information where, as will be described further below, the combination of the two-dimensional array and the technical AOVs that define location information may be used to generate a coherent two-dimensional image that is shown to the user. The two-dimensional image is generated using location information defined by the technical AOVs.

In some embodiments, the user is able to edit the technical AOVs (e.g. in order to reposition a point). In some embodiments, the technical AOVs are arranged so that they cannot be edited. Therefore, a user may be able to edit a colour of the points but not to edit a location of the points.

Instead, the user may be able to generate cr edit (e.g. parameters of) one or more of a second set of AOVs (hereafter referred to as ‘editable’ or ‘standard compositing’ AOVs) and/or a user may be able to edit a colour of the entries in the array. For example, the standard compositing AOVs may relate to a diffuse or a specular associated with an entry of the array.

With this arrangement, the technical AOVs can be used to correctly position (e.g. order) the entries of the array so as to provide a coherent two-dimensional image (e.g. a visualisation image) to a user (before an image is output by processing software). By performing this transformation of the entries of the array at this late stage, e.g. after the editing of any points, the arrangement of Figure 16 enables the user to edit the point data and/or the entries of the array and/or the "standard compositing” AOVs in order to directly modify the point data of the three-dimensional representation.

A reason for providing location information in the technical AOVs is that it enables three-dimensional location information to effectively be encoded in a two-dimensional data format. Transforming the three- dimensional points into an accurate two-dimensional visualisation image requires a transformation to occur, so that if a transformation module is positioned before an editing module then this editing module will necessarily be working on transformed data and not on the three-dimensional point data. By using the technical AOVs and locating a transformation module after an editing module, a user is able to work directly on three-dimensional point data while still being shown an accurate two-dimensional image. In this regard, in a fourth step 84, the computer device identifies one or more modifications to the point data (and/or the entries of the two-dimensional array).

In a fifth step 85, the computer device re-renders the image (e.g. re-transforms the entries of the two- dimensional array) based on the modification and the technical AOVs. This step may occur following a user input (e.g. to confirm a change made to a rendered image).

With this method, a user is able to directly modify the points of the three-dimensional representation by modifying a two-dimensional array that is transformed for the user into a two-dimensional visualisation image so that a user can readily identify the result of their modifications. That is, since there is a direct correspondence between the technical AOVs and the fields of the points of the three-dimensional representation that define the positions of these points in the three-dimensional space, the computer device is able to transform the entries of the array into a two-dimensional visualisation image such that any modification that is made to the entries readily associated with points of the two-dimensional image (and vice versa).

To visualise this arrangement, we refer to the system of Figure 14.

This system comprises a reader module 91 that receives point data of the three-dimensional representation. This reader 91 is arranged to parse the point data in order to determine the one or more technical AOVs. In particular, the reader may generate one or more technical AOVs based on components of the point data, such as location data, capture device identifier data, normal data, and/or distance data (the distance indicating a distance of a point from a capture device or a viewing zone). These technical AOVs are then transferred to a technical AOV storage module 92.

The reader module 91 also generates (or interprets) a two-dimensional array (e.g. an array of entries or pixels), where each entry in the array is associated with a set of one or more technical AOVs. Essentially, a three-dimensional point is stored (or interpreted) in the form of a single value in the array (e.g. a colour value) and a set of technical AOVs. The three-dimensional point may also be associated with a set of ‘editable’ or ‘standard compositing AOVs’ that define, e.g. diffuse, specular, etc. forthe entries ofthe array).

The technical AOVs 92 are then transferred from this technical AOV storage module 92 to a transformation module 93, which transformation module uses the technical AOVs to render a two-dimensional visualisation image based on the two-dimensional array.

More specifically, the transformation module 93 is arranged to position the points of the two-dimensional array based on the technical AOVs in order to form the two-dimensional visualisation image of a scene. This two-dimensional visualisation image can then be shown to a user.

To enable a user to modify the point data, the system comprises an editing module 94. This editing module may comprise functions of two-dimensional image editing software (e.g. Nuke™). A user is then able to edit the point data (e.g. the entries in the array) directly using the editing module. This editing may comprise the use of a second set of "standard compositing” AOVs (which second set of AOVs is separate to the first, technical, set of AOVs) and may involve editing the values of the entries in the array.

The editing module is arranged to detect each modification that is made using the editing module and to cause the re-transformed of the two-dimensional visualisation image (via the transformation module 93) based on these modifications. Typically, the image is re-transformed following each modification. Equally, the image may be re-transformed periodically and/or following a number of modifications. The image may be re-transformed based on a user input.

Therefore, a user modifies the point data directly, with these modifications being interpreted via the viewer to present a coherent two-dimensional visualisation image to the viewer. Importantly, this enables point data of the three-dimensional representation to be modified using two-dimensional image processing software so as to enable efficient modifying of three-dimensional data. Specifically, the input to the editing module 94 is a two-dimensional array (e.g. a two-dimensional image), where the entries of this array (e.g. the pixels of the image) are associated with AOVs. This is a typical format for data in two-dimensional image editing software. The use of the technical AOVs effectively enables the system to encode information that enables the correct position of three-dimensional points in this conventional two-dimensional format.

As described above, the two-dimensional array that is passed to the editing module 94 may be determined by parsing points of the three-dimensional representation and this may result in a non-representative or incoherent two-dimensional image. In this regard, the location, transparency, etc. of the points are typically stored as the technical AOVs, so the pixels of the two-dimensional image formed by the entries of the array may essentially be a random (or at least non-meaningful) arrangement of values. In a practical example, the points may be grouped based on the capture device used to capture these points and so the pixels may be grouped similarly. Due to this, the two-dimensional array formed in the first step 81 is typically not visually meaningful to a user.

Therefore, to enable a user to evaluate and modify the two-dimensional array the method may include transforming the two-dimensional array (using the transformation module 93) based on the technical AOV values of the entries in the array in order to provide a visually-accurate rendering of the scene to a user (e.g. using an X-Y transformation projection).

In other words, the computer device may be arranged to determine a transformation of the two-dimensional array based on the AOV values of the entries of the array, e.g. the locations and attributes of the entries that are indicated by the AOV values, in order to provide a two-dimensional representation (the two- dimensional visualisation image) of the scene to a user.

This method typically requires some up-front processing, where a suitable set of technical AOVs (and/or a suitable transformation) must be generated that translates the pixels in the two-dimensional array into a visually meaningful image based on the technical AOV values of those pixels (which AOV values identify locations and attribute values of the pixels). Following this up-front processing, a user is able to work directly (or near-directly) on the points from the three-dimensional representation and to directly modify the three- dimensional points.

As described above, a user is typically able to modify (e.g. manipulate and define) the final colour of the image using “standard compositing” AOVs associated with the entries of the two-dimensional array in order to enact a desired change to an image, and the computer device is then able to modify point data of a corresponding point the three-dimensional representation based on a correspondence between the entries of the array and the standard compositing AOVs and the points of the three-dimensional representation.

In a practical example, where a user wishes to edit the colour of a point, they may directly edit this colour (e.g. in the three-dimensional representation) or they may define a new colour for the point based on the “standard compositing” AOVs of this point and the computer device may then modify the attribute data of this point. Equally, the user may edit the attribute data (e.g. the colour data) of the point directly, and the computer device may then update the corresponding entry in the array based on the edited point data in order to update an image being viewed by a user. Furthermore, while modifications have primarily been described with reference to modifying point colour, a user may equally be able to modify the technical AOVs as well as the “standard compositing AOVs”.

Such embodiments enable a userto work directly on the point data of the three-dimensional representation, with the modifications being reflected (via the transformations) into a meaningful image in near-real time.

Formation of a bitstream

The processing of a three-dimensional representation that has been described above results in a processed three-dimensional representation that may then be stored and/or transmitted by a computer device. In particular, the processed three-dimensional representation may be encoded in a bitstream that is transmitted to another device and/or the processed three-dimensional representation may be used to render one or more two-dimensional images, which images may be encoded in a bitstream that is transmitted to another device. This bitstream can then be decoded by this other device in order to extract the processed three-dimensional representation and/or the two-dimensional image(s) from the bitstream.

The present disclosure envisages a bitstream that contains or references a background video, where this background video is associated with one or more background points of the three-dimensional representation. For example, the bitstream may comprise a first section that defines a plurality of points of a three-dimensional representation (including one or more background points) and a second section that defines, or references, on or more two-dimensional background images that are associated with these points and/or that reference video data associated with these points.

Figure 15 shows a schematic of such a bitstream comprising two sections, where each section comprises one or more bits.

Bit-a to Bit-d forms a first section of the bitstream that signals one or more points of a three-dimensional representation and/or that defines one or more immersive two-dimensional images.

Bit-e to Bit-f forms a second section of the bitstream that references or defines one or more two-dimensional background images (or videos) that are referenced by the points of the three-dimensional representation.

These two sections of the bitstream may be differently encoded or decoded. For example, the second section may be encoded/decoded using traditional two-dimensional video coding techniques (such as HEVC, VVC, and/or LCEVC techniques), while the first section may be encoded/decoded using a three- dimensional encoding/decoding technique.

While the bitstream is typically encoded in the order of the sections provided above, the bitstream may be encoded in any order. In some embodiments bits from the first section and the second section may be ‘interlaced’, where any background point that is signalled in the bitstream is immediately followed by a corresponding background video.

The bitstream described above may be decoded by a decoding device and this may allow the original (or similar to the original) plurality of images to be re-generated. A method of decoding said bitstream may comprise the steps of: identifying a first and a second section of bits in a bitstream; generating a plurality of initial two-dimensional immersive images based on the bits in the first section of the bitstream; generating one or more two-dimensional background images based on the bits in the second section of the bitstream; and combining the initial immersive and background images to form one or more final two-dimensional immersive images.

In some embodiments, the aforementioned sections of a bitstream are arranged to be decoded separately (e.g. by separate computer devices or processing units), where this enables a parallelised method of decoding the bitstream so as to speed up a process of decoding and rendering a scene. For example, each pair of background points and background images may be decoded and processed separately.

In some embodiments, the bitstream comprises one or more flags that indicate features of the bitstream and/or of the three-dimensional representations or two-dimensional images signalled by the bitstream. For example, the bitstream may comprise one or more flags that indicate: whether the three-dimensional representation contains any background points (e.g. references any two-dimensional background images); a feature, e.g. resolution or a size, of the two-dimensional background images; a location of a repository that contains the two-dimensional images referenced by the background points (e.g. the repository may be the second section of the bitstream or may be a separate repository); and a process by which the two- dimensional background images should be combined with the initial two-dimensional immersive images (e.g. whether they should be provided behind the other points ofthe immersive images and/or whether they should be presented in layers within the immersive images).

Alternatives and modifications

It will be understood that the present invention has been described above purely by way of example, and modifications of detail can be made within the scope of the invention.

The representation is typically arranged to provide an extended reality (XR) experience (e.g. a representation that is useable to render a XR video). The term extended reality (XR) covers each of virtual reality (VR), augmented reality (AR), and mixed reality (MR) and it will be appreciated that the disclosures herein are applicable to any of these technologies.

The representation may be encoded into, and/or transmitted using, a bitstream, which bitstream typically comprises point data for one or more points of the three-dimensional representation. The point data may be compressed or encoded to form the bitstream. The bitstream may then be transmitted between devices before being decoded at a receiving device so that this receiving device can determine the point data and reform the three-dimensional representation (or form one or more two-dimensional images based on this three-dimensional representation). In particular, the encoder 13 may be arranged to encode (e.g. one or more points of) the three-dimensional representation in order to form the bitstream and the decoder 14 may be arranged to decode the bitstream to generate the one or more two-dimensional images.

In some embodiments, the scene comprises a static scene; alternatively, in some embodiments the scene comprises a video and/or a moving (e.g. non-static) scene. That is, in some embodiments the scene comprises a static scene, such as a building, where a viewer is able to move through this scene, e.g. to view different rooms of the building, but where the scene itself does not change. In some embodiments, the scene comprises a moving scene, where elements of the scene vary in time even where the viewer remains stationary. It will be appreciated that typically the scene comprises both static and moving elements where, for example, non-static elements move in front of a static background.

Reference numerals appearing in the claims are by way of illustration only and shall have no limiting effect on the scope of the claims.

Claims

1 . A method of processing a three-dimensional representation of a scene, the method comprising: rendering one or more two-dimensional objects based on points of the three-dimensional representation; identifying a modification of one or more of the objects; identifying one or more points of the three-dimensional representation that are associated with the identified two-dimensional objects; and outputting the modifications and the identified points of the three-dimensional representation.

2. The method of claim 1 , comprising identifying a modification during a compositing stage.

3. The method of any preceding claim, comprising: for one or more of the points of the three-dimensional representation, generating a first set of arbitrary output values (AOVs); and rendering the one or more two-dimensional objects based on the first set of AOVs.

4. The method of claim 3, wherein the first set of AOVs defines location information of the two-dimensional objects, preferably wherein the first set of AOVs is defined based on location information of the one or more points of the three-dimensional representation, more preferably wherein the first set of AOVs comprises one or more of: an AOV defining a normal value of a corresponding point; an AOV defining a capture device associated with a corresponding point; and an AOV defining a distance of a corresponding point from a capture device.

5. The method of claim 3 or 4, wherein the first set of AOVs is non-modifiable.

6. The method of any preceding claim, wherein identifying a modification of the objects comprises identifying a modification of a colour of an object and/or identifying a modification of an AOV associated with an object and/or a point, preferably wherein the modified AOV is an editable AOV.

7. The method of claim 6, wherein the one or more points are associated with: a first set of AOVs that defines a location of the point, preferably wherein the first set of AOVs is non-modifiable; and a second set of AOVs that defines a colour of the point, preferably wherein the second set of AOVs is modifiable.

8. The method of any preceding claim, comprising: determining that a modification has been made to one or more of the objects; and re-rendering the two-dimensional objects based on the modification; preferably, comprising re-rendering the two-dimensional objects based on the first set of AOVs for the one or more points of the three-dimensional representation.

9. The method of any preceding claim, comprising: associating one or more arbitrary output values (AOVs) with the two-dimensional objects; identifying a modification of one or more of the AOVs; and identifying the two-dimensional objects that are associated with the modified AOVs; preferably, the method comprises modifying the identified points, more preferably, the method comprises modifying the identified points based on the modifications to the AOVs.

10. The method of any preceding claim, comprising: storing, preferably at the time of rendering the two-dimensional object, a correspondence between the two-dimensional objects and corresponding points of the three-dimensional representation; wherein identifying the points of the three-dimensional representation comprises identifying the points based on the correspondences.

11 . The method of any preceding claim, comprising: determining one or more datafields associated with the points; and generating one or more AOVs based on the values of the datafields; preferably, wherein the datafields define a location and/or one or more attributes of the points.

12. The method of any preceding claim, comprising: generating a two-dimensional array based on the points, wherein each entry of the two-dimensional array is associated with a point of the three-dimensional representation; and associating each entry of the array with one or more AOVs, the AOVs representing attributes ofthe point associated with said entry; preferably wherein: the AOVs indicate a location of each point in the three-dimensional representations; and/or the AOVs indicate an attribute value of each point, preferably wherein the AOVs one or more of: a normal; a transparency; a colour; a left eye attribute value; and a right eye attribute value.

13. The method of any preceding claim, comprising determining a transformation that converts the two- dimensional array into a two-dimensional image that represents the scene, preferably wherein the transformation is determined based on the values of the AOVs associated with each entry.

14. The method of any preceding claim, wherein rendering the objects comprises: determining one or more scene files, wherein at least one scene file comprises a three-dimensional representation of the scene; and rendering one or more two-dimensional objects based on the scene files.

15. The method of any preceding claim, comprising: compositing the rendered two-dimensional objects to form a two-dimensional immersive image; and/or superimposing a two-dimensional object rendered based on the three-dimensional representation onto a two-dimensional background image.

16. The method of any preceding claim, comprising rendering a plurality of layers associated with the three- dimensional representation, wherein each layer comprises a two-dimensional image and wherein each layer is associated within one or more points of the three-dimensional representation that are similar distances from a viewing zone of the three-dimensional representation, preferably wherein each layer is associated with a respective set of AOVs.

17. The method of any preceding claim, wherein the modification relates to one or more of: an attribute value; a colour; a location; a normal; and a transparency.

18. The method of any preceding claim, wherein the three-dimensional representation is associated with a viewing zone, the viewing zone comprising a subset of the scene and/or the viewing zone enabling a user to move through a subset of the scene, preferably wherein: the user is able to move within the viewing zone with six degrees of freedom (6DoF); and/or the viewing zone has a volume of less than 50% of the volume of the scene, less than 20% of the volume of the scene, and/or less than 10% of the volume of the scene; and/or the viewing zone has, or is associated with, a volume, preferably a real-world volume, of less than five cubic metres (5m³), less than one cubic metre (1 m³), less than one-tenth of a cubic metre (0.1 m³) and/or less than one-hundredth of a cubic metre (0.01 m³).

19. The method of any preceding claim, wherein the three-dimensional representation comprises a point cloud.

20. The method of any preceding claim, comprising forming a bitstream that includes the point.

21 . A computer program product comprising software code that, when executed on a computer device, causes the computer device to perform the method of any preceding claim.

22. A machine-readable storage medium that includes instructions that, when executed by one or more processors of a machine, cause the machine to perform the method of any of claims 1 to 20.

23. An apparatus for processing a three-dimensional representation of a scene, the apparatus comprising: means for rendering one or more two-dimensional objects based on points of the three-dimensional representation; means for identifying a modification of one or more of the objects; means for identifying one or more points of the three-dimensional representation that are associated with the identified two-dimensional objects; and means for outputting the modifications and the identified points of the three-dimensional representation.

24. A bitstream comprising one or more of the points modified using the method of any of claims 1 to 20.

25. An apparatus, preferably an encoder, for forming and/or encoding the bitstream of claim 24.

26. An apparatus, preferably a decoder, for receiving and/or decoding the bitstream of claim 24.

27. A system for processing a three-dimensional representation of a scene, the method comprising: a viewer module for: rendering one or more two-dimensional objects based on points of the three-dimensional representation; identifying a modification of one or more of the objects; and re-rendering the two-dimensional objects based on the modification; and an editing module for: identifying one or more points of the three-dimensional representation that are associated with the identified two-dimensional objects; and outputting the modifications and the identified points of the three-dimensional representation.

28. The system of claim 27, comprising a reader module for: determining, for one or more points of the three-dimensional representation, a first set of arbitrary output values (AOVs), the first set of AOVs defining location information for the one or more points, wherein the viewer module is arranged to render the two-dimensional objects based on the first set of AOVs for the points.