WO2025122035A1

WO2025122035A1 - Processing 3d media streams

Info

Publication number: WO2025122035A1
Application number: PCT/SE2023/051217
Authority: WO
Inventors: Elijs Dima; Volodya Grancharov; Natalya TYUDINA; Ali El Essaili
Original assignee: Telefonaktiebolaget LM Ericsson AB
Current assignee: Telefonaktiebolaget LM Ericsson AB
Priority date: 2023-12-04
Filing date: 2023-12-04
Publication date: 2025-06-12
Anticipated expiration: 2026-06-04

Abstract

An apparatus for processing 3D media streams is provided. The apparatus is operative to receive a depth frame (201) captured by a depth sensor (112), determine (212) a relevant region for the captured depth frame, the relevant region encompassing a silhouette of at least part of a person captured in the depth frame, generate (213) a smoothed silhouette by processing the silhouette of the at least part of a person, the smoothed silhouette having smoother borders than the silhouette comprised in the captured depth frame, determine (214) a relevant depth range for the captured depth frame, the relevant depth range encompassing depth values representing the captured at least part of a person, and generate a smoothed depth frame (202) by selectively updating (215) depth values in the captured depth frame based on the relevant depth range and the relevant region.

Description

PROCESSING 3D MEDIA STREAMS

Technical field

The invention relates to an apparatus for processing 3D media streams, a method of processing 3D media streams, a corresponding computer program, a corresponding computer-readable data carrier, and a corresponding data carrier signal.

Background

Holographic communication refers to real-time capturing, encoding, transporting, and rendering, of three-dimensional (3D) representations, which are anchored in space, of persons shown as stereoscopic images or as 3D video in eXtended-Reality (XR) headsets, delivering a visual effect which is similar to a hologram and providing an immersive experience for the viewer.

Typically, holographic communication involves capturing a 3D representation of a person, using an RGB-D camera or any other type of 3D camera which captures both visual and structural information, encoding the captured 3D representation, streaming the encoded 3D representation as a 3D media stream over one or more communications networks, and rendering the 3D representation to a viewer, using an Augmented-Reality (AR) headset, a stereoscopic display, or the like. Holographic communication can, e.g., be used for tele-health, remote working and learning, live entertainment, and other use-cases, to provide immersive representations of participants, with high expectations of the quality of the 3D representations which are displayed to viewers, in particular for facial features and borders or edges.

Creating a 3D representation of a person using sensor data captured by an RGB-D camera, which provides both RGB frames (i.e. , visual information about the captured person) and depth frames (i.e., structural information about the captured person), involves projecting the captured depth frame into 3D space to create a point cloud or a 3D mesh, and subsequently coloring the point cloud or 3D mesh using the captured RGB frame. The colored point cloud or 3D mesh can then be rendered as a 3D representation (aka hologram) to a viewer.

The captured RGB-D data frequently suffers from different types of depth-measurement errors, which result in visible errors in the rendered 3D representation of the captured person. For example, a known problem is the lack of smoothness of the boundaries of the point cloud or 3D mesh, in particular the edges of the torso and the head of the captured person, leading to an unnatural 3D representation of the captured person when being rendered to the viewer. Known approaches address such issues of the captured RGB-D data prior to projecting the captured depth frame into 3D space to create a point cloud or 3D mesh (see, e.g., [1] “Approximate Depth Shape Reconstruction for RGB-D Images Captured from HMDs for Mixed Reality Applications”, by N. Awano, Journal of Imaging, vol. 6, doi: 10.3390/jimaging6030011 , NDPI, 2020; [2] “Depth Map Super-Resolution for Cost-Effective RGB-D Camera”, by R. Takaoka and N. Hashimoto, 2015 International Conference on Cyberworlds (CW), doi: 10.1109/CW.2015.32, IEEE, 2015; [3] “Image Guided Depth Super-Resolution for Spacewarp in XR Applications”, by C. Peri and Y. Xiong, 2021 IEEE International Conference on Consumer Electronics (ICCE), doi: 10.1109/ICCE30685.2021 .9427716, IEEE, 2021 ).

However, outstanding issues remain with those techniques. In general, relying on color (RGB) frames captured by the RGB-D camera can be unreliable, and in practice often requires adjusting parameters to specific frames or images to achieve acceptable results. This is the case since depth transitions, e.g., from foreground to background, may not necessarily be accompanied by sufficiently large color or luminance differences in corresponding locations in the RGB frame. Moreover, as methods relying on RGB frames often utilize color-edge based segmentation (e.g., using the Canny operator, which is notoriously sensitive to manual parametrization tuning), they are vulnerable to detecting edges which in fact are non- structural. Furthermore, processing RGB frames is computationally demanding and may not be possible in real-time communication systems such as 3D video or holographic communication.

Another issue arises in relation to the smoothing borders of silhouettes of a person captured by an RGB-D sensor and presented as a 3D representation, as the solutions referred to above (see [1 ]— [3]) do not focus on the subjective smoothness of borders. As a result, such methods may worsen the rendered 3D representation. Further, the method presented in [3] significantly reduces the resolution of the captured depth frames and results in low-resolution point clouds or 3D meshes, and thus a rendered 3D representation with low resolution.

Summary

It is an object of the invention to provide an improved alternative to the above techniques and prior art.

More specifically, it is an object of the invention to provide improved solutions for processing 3D media streams. In particular, it is an object of the invention to provide an improved smoothing of depth frames capturing a person.

These and other objects of the invention are achieved by means of different aspects of the invention, as defined by the independent claims. Embodiments of the invention are characterized by the dependent claims.

According to a first aspect of the invention, an apparatus for processing 3D media streams is provided. The apparatus is operative to receive a depth frame captured by a depth sensor. The apparatus is further operative to determine a relevant region for the captured depth frame. The relevant region encompasses a silhouette of at least part of a person captured in the depth frame. The apparatus is further operative to generate a smoothed silhouette by processing the silhouette of the at least part of a person. The smoothed silhouette has smoother borders than the silhouette comprised in the captured depth frame. The apparatus is further operative to determine a relevant depth range for the captured depth frame. The relevant depth range encompasses depth values representing the captured at least part of a person. The apparatus is further operative to generate a smoothed depth frame. The smoothed depth frame is generated by selectively updating depth values in the captured depth frame based on the relevant depth range and the relevant region.

According to a second aspect of the invention, a method of processing 3D media streams is provided. The method comprises receiving a depth frame captured by a depth sensor. The method further comprises determining a relevant region for the captured depth frame. The relevant region encompasses a silhouette of at least part of a person captured in the depth frame. The method further comprises generating a smoothed silhouette by processing the silhouette of the at least part of a person. The smoothed silhouette has smoother borders than the silhouette comprised in the captured depth frame. The method further comprises determining a relevant depth range for the captured depth frame. The relevant depth range encompasses depth values representing the captured at least part of a person. The method further comprises generating a smoothed depth frame. The smoothed depth frame is generated by selectively updating depth values in the captured depth frame based on the relevant depth range and the relevant region.

According to a third aspect of the invention, a computer program is provided. The computer program comprises instructions which, when the computer program is executed by one or more processors comprised in a computing device, cause the computing device to carry out the method according to an embodiment of the second aspect of the invention.

According to a fourth aspect of the invention, a computer-readable data carrier provided. The computer-readable data carrier has stored thereon the computer program according to the third aspect of the invention.

According to a fifth aspect of the invention, a data carrier signal is provided. The data carrier signal carries the computer program according to the third aspect of the invention.

The invention makes use of an understanding that more visually pleasing 3D representations of captured persons, with smooth, natural boundaries, may be obtained by applying silhouette-based depth frame smoothing only to pixels or points with depth values in a depth range which is commensurate with the captured person, and which are located in a region of the depth frame where a captured person can be expected. As a result, the smoothing primarily affects the head and upper body of the captured person, but not any other elements such as tables, laptops etc.

Even though advantages of the invention have in some cases been described with reference to embodiments of the first aspect of the invention, corresponding reasoning applies to embodiments of other aspects of the invention.

Further objectives of, features of, and advantages with, the invention will become apparent when studying the following detailed disclosure, the drawings, and the appended claims. Those skilled in the art realize that different features of the invention can be combined to create embodiments other than those described in the following.

Brief description of the drawings

The above, as well as additional objects, features and advantages of the invention, will be better understood through the following illustrative and non-limiting detailed description of embodiments of the invention, with reference to the appended drawings, in which:

Fig. 1 schematically illustrates creating a 3D representation of a person based on captured RGB frames and depth frames.

Fig. 2 schematically illustrates depth-frame smoothing in accordance with embodiments of the invention.

Fig. 3 schematically illustrates embodiments of the apparatus for processing 3D media streams.

Fig. 4 exemplifies a captured depth frame and masks derived therefrom, in accordance with embodiments of the invention.

Fig. 5 exemplifies silhouette smoothing with different numbers of iterations, in accordance with embodiments of the invention.

Fig. 6 shows a flow chart illustrating embodiments of the method of processing 3D media streams.

All the figures are schematic, not necessarily to scale, and generally only show parts which are necessary in order to elucidate the invention, wherein other parts may be omitted or merely suggested.

Detailed description

The invention will now be described more fully herein after with reference to the accompanying drawings, in which certain embodiments of the invention are shown. This invention may, however, be embodied in many different forms and should not be construed as limited to the embodiments set forth herein. Rather, these embodiments are provided by way of example so that this disclosure will be thorough and complete, and will fully convey the scope of the invention to those skilled in the art.

Embodiments of the invention are described with reference to Fig. 1 , which schematically illustrates creating a 3D representation of a person based on captured RGB frames (i.e. , visual information about the captured person) and depth frames (i.e. , structural information about the captured person. The captured RGB frames may be received from an RGB sensor 111 , also known as RGB camera or color camera. The depth frames may be received from a depth sensor 112, also known as range sensor. The depth sensor 112 may, e.g., be a stereo camera, based on structured light or on time-of-flight (e.g., LIDAR). Optionally, the RGB sensor 111 and the depth sensor 112 may be integrated into a single unit, known as RGB-D sensor or RGB-D camera 110.

Known solutions for creating a 3D representation of a person based on captured RGB frames and depth frames involves projecting 120 the captured depth frame which is received from the depth sensor 112 into 3D space to create a point cloud or a 3D mesh, which represents structural information only (in Fig.1 referred to as “Projection to 3D mesh” to obtain an “Uncolored 3D mesh”). Subsequently, the uncolored 3D mesh is textured, or colored 130 (in Fig.1 referred to as “Mesh texturing” to obtain a “Colored 3D mesh”) using the RGB frame received from the RGB sensor 111.

3D meshes are geometric structures with vertices (points), edges (connecting vertices), and surfaces or faces (usually defined as triangles delimited by three vertices). Coloring or texturing a 3D mesh is achieved by coloring the surfaces or faces of the 3D mesh. Point clouds on the other hand are collections of vertices only, without edges or surfaces/faces. Coloring or texturing a point cloud is achieved by coloring the points (vertices). During rendering, the points or vertices are typically given some non-zero radius to make them visible. Whereas both approaches, 3D meshes or point clouds, have their specific advantages and disadvantages, both can be used in relation to embodiments of the invention which are described throughout this disclosure. However, for the sake simplifying the disclosure, embodiments of the invention are foremost described in relation to 3D meshes, which are also preferred as they better represent captured persons. The colored 3D mesh can then be rendered as a 3D representation (hologram) on a display 140, such as a stereoscopic display, an AR headset, or AR glasses, to a viewer. In addition to the RGB frames and depth frames captured by the RGB sensor 111 and depth sensor 112, respectively, the projecting 120 the captured depth frame into a 3D mesh and coloring 130 the 3D mesh also requires projection parameters which are obtained from the RGB sensor 111 and/or the depth sensor 112. The projection parameters typically comprise parameters related to the intrinsic camera matrix, which can be used to map a 3D point to a 2D image plane, and vice-versa (2D to 3D), given a depth frame. The projection parameters may, e.g., comprise focal length (both horizontally and vertically) and principal-point coordinates in the image plane.

In addition to what is described hereinbefore for known solutions for creating a 3D representation of a person, embodiments of the invention rely on smoothing 200 depth frames (in Fig. 1 referred to as “Depth frame smoothing” to obtain a “Smoothed depth frame”) which are received from the depth sensor 112 before projecting 120 the smoothed depth frame onto a 3D mesh and coloring/texturing 130 the (uncolored) 3D mesh to obtain a colored 3D mesh for subsequent rendering on the display 140.

Embodiments of the invention apply silhouette-based depth frame smoothing in a relevant depth range and in a relevant region of the captured depth frame, such that the smoothing primarily affects the head and upper body of a captured person, but preferably not any other elements such as keyboard, computer mouse, etc.

In the following, embodiments of the apparatus 300 for processing 3D media streams are described with reference to Fig. 2, which schematically illustrates depth-frame smoothing in accordance with embodiments of the invention, and Fig. 3. In the present context, 3D media streams are understood to encompass streams carrying depth frames and optionally RGB (color) frames. The apparatus 300 for processing 3D media streams (herein also referred to as the “apparatus”) may comprise one or more processors 301 , such as Central Processing Units (CPUs), microprocessors, application processors, application-specific processors, Graphics Processing Units (GPUs), and Digital Signal Processors (DSPs) including image processors, or a combination thereof, and a memory 302 comprising a computer program 303, i.e., software, comprising instructions. When executed by the processor(s) 301 , the instructions cause the apparatus 300 to be operative in accordance with embodiments of the invention described herein. The memory 302 may, e.g., be a Random-Access Memory (RAM), a Read-Only Memory (ROM), a Flash memory, or the like. The computer program 303 may be downloaded to the memory 302 by means of a network interface circuitry 304, as a data carrier signal carrying the computer program 303. The apparatus 300 may alternatively or additionally comprise one or more Application-Specific Integrated Circuits (ASICs), Field-Programmable Gate Arrays (FPGAs), or the like, which are operative to cause the apparatus 300 to be operative in accordance with embodiments of the invention described herein.

The network interface circuitry 304 may comprise one or more of a cellular modem (e.g., GSM, UMTS, LTE, 5G, or higher generation), a WLAN/Wi-Fi modem, a Bluetooth modem, an Ethernet interface, an optical interface, or the like, for exchanging data between the apparatus 300 and one or more other computing devices for providing a 3D representation of a participant captured by an RGB-D sensor 110 for rendering on a display 140 to a viewer, by means of a 3D media stream. The 3D media stream may be carried directly between a sender device (which is connected to or comprises the RGB-D sensor 110) and a receiver device (which is connected to or comprises the display 140), or via one or more communications networks like a Local Area network (LAN), operator networks, or the Internet. The apparatus 300 for processing 3D media streams may be comprised in a sender device, which may optionally comprise the RGB sensor 111 and the depth sensor 112, i.e. , the RGB-D sensor 110. The sender device may, e.g., be a mobile phone, a tablet, a 3D video camera, or the like. Alternatively, the apparatus 300 for processing 3D media streams may be comprised in a receiver device, which may optionally comprise the display 140. The receiver device may, e.g., be a mobile phone, a tablet, a 3D display, an AR headset, or the like. As yet a further alternative, the apparatus 300 for processing 3D media streams may be comprised in an intermediate device in-between the sender device, from which captured RGB frames and depth frames are received, and the receiver device, to which colored 3D frames are transmitted for rendering. In this case, the apparatus 300 for processing 3D media streams may, e.g., be comprised in an edge server, a cloud server, an application server, or any other type of network node. It will also be appreciated that the apparatus 300 for processing 3D media streams may be distributed, i.e., different parts of its functionality may be collaboratively provided by different computing nodes or processor, e.g., in a cloud computing environment.

More specifically, the apparatus 300 for processing 3D media streams is operative to receive a depth frame 201 captured by a depth sensor 112. A depth frame, also referred to as depth map or depth image, is typically a bitmap, where each point or pixel holds a distance value (between the depth sensor 112 and a captured object, such as a person). The depth frame may be encoded as a single-channel (“grayscale”) image, where each grayscale value represents a depth value, or as a color (RGB) image. In the latter case, a transformation between a depth value and the RGB color value is used. Optionally, the received captured depth frame 201 may have been pre- processed to remove depth values which represent a background of the captured scene, e.g., objects which are located behind the captured person. This may, e.g., be achieved by removing depth values which exceed a distance threshold. Depending on the distance between the depth sensor 112 and the captured person, the distance threshold may, e.g., be set to a value which is 0.5 to 2 m larger than the distance between the depth sensor 112 and the captured person.

The apparatus 300 for processing 3D media streams is further operative to determine 212 a relevant region for the captured depth frame 201. The relevant region encompasses a silhouette of at least part of a person captured in the depth frame 201. In the present context, a silhouette is understood to be the continuous enclosing outline of a at least part of a person captured by the depth frame.

The apparatus 300 may be operative to determine 212 the relevant region for the captured depth frame 201 by identifying a boundary between a foreground and a background in the captured depth frame 201 . In the present context, foreground and background are to be understood with respect to the expected content of the captured depth frame 201 , which, for typical use cases of holographic communication, or 3D video, involves one or more persons, typically sitting at a desk or table, with some walls, office furniture, or similar objects behind the person(s) or surrounding them. In such case, the foreground would be the person(s) and any objects adjacent or in front of them, such as a keyboard or laptop. The background, on the other hand, would be the objects behind the person. That is, the boundary between the foreground and the background indicates the boundary around the person and, optional, any objects directly adjacent to, or in front of, the captured person.

If the background is already removed from the captured depth frame 201 , e.g., by removing depth values which exceed a distance threshold, as is described above, the depth values in the captured depth frame 201 only represent the foreground, whereas points or pixels in the depth frame with missing depth values represent the background. In a depth frame represented by a bitmap, a missing depth value can be encoded using a reserved value, such as zero, NaN (Not a Number), or the maximum depth value for depth frames captured by the depth sensor 112 (e.g., 255 for 8-bit range, or 65535 for 16-bit range, of depth values).

If depth values representing the background in the captured depth frame 201 are set to zero, the boundary between the foreground and the background may, e.g., be identified by determining a difference between morphological dilation and erosion of the captured depth frame 201 . For both dilation and erosion, a morphological kernel is used which determines how “wide” the boundary is. The difference between dilation and erosion is nonzero throughout the whole foreground, as erosion and dilation induce small depth changes throughout the depth frame. In practice, to obtain a useful mask (a frame with binary values, i.e. , zero or one), the difference between dilation and erosion is binarized based on a threshold which is set in relation to the maximum depth value of the captured depth frame 201. A suitable value for this threshold is about 10% of the maximum depth value of the captured depth frame 201. That is, if the difference between dilation and erosion for a point or pixel of the depth frame is larger than the threshold value, the corresponding point or pixel in the mask is set to one. Correspondingly, if the difference between dilation and erosion for a point or pixel of the depth frame is equal to or smaller than the threshold value, the corresponding point or pixel in the mask is set to zero.

The apparatus 300 for processing 3D media streams may further be operative to determine 212 the relevant region for the captured depth frame 201 by excluding a lower part of the captured depth frame 201 . For instance, the lower 20% of (the height of) the captured depth frame 201 may be discarded or excluded from processing. This is based on the understanding that in typical video-communication scenarios, including holographic communications, the lower part of captured frames frequently contains objects like desk, keyboard, or the like. In Fig. 3, processing of captured depth frames in accordance with embodiments of the invention is exemplified. The left image illustrates a depth frame of a captured person, whereas the center image illustrates the relevant region, in the form of a binary mask, in which white points or pixels represent the boundary between the foreground (inside the boundary) and the background (outside the boundary). Preferably, the relevant region has a thickness of more than one pixel or point, to accommodate for low depth resolution and the lacking precision of many commercial RGB-D sensors to accurately capture edges of objects.

The apparatus 300 for processing 3D media streams is further operative to generate 213 a smoothed silhouette by processing the silhouette of the at least part of a person. The smoothed silhouette has smoother borders than the silhouette comprised in the captured depth frame 201. For instance, the apparatus 300 may be operative to generate 213 the smoothed silhouette by binarizing the depth frame, and iteratively dilating, smoothing, and eroding, the binarized depth frame. This results in a re-positioning and smoothing of the boundary, which continues to have a progressive effect with increasing number of iterations. Therefore, the suitable number of iterations of dilating, smoothing, and eroding, the binarized depth frame is not obvious and can differ between different captured depth frames 201 .

The apparatus 300 may be operative to terminate the iteratively dilating, smoothing, and eroding, the binarized depth frame when a fraction of the smoothed silhouette in the dilated, smoothed, and eroded, binarized depth frame which extends outside the relevant region exceeds a threshold. In this way, the iterative dilating, smoothing, and eroding, of the binarized depth frame can be terminated in an adaptive manner. In practice, this can be achieved by detecting the contours in the binarized and smoothed depth frame, identifying the silhouette contour from the set of all detected contours, and calculating how many contour pixels are inside the relevant region and how many are outside it. For this purpose, the relevant region may be dilated to broaden the permissible range, and boundary pixels may be added to the relevant region so that the bottom and side edges of the silhouette contour do not immediately trigger a termination of the iterative dilating, smoothing, and eroding, of the binarized depth frame after the first iteration. If the number of contour pixels outside the relevant region exceeds a threshold, the iterative dilating, smoothing, and eroding, of the binarized depth frame is terminated. In practice, a minimum (e.g., 3 iterations) and/or a maximum (e.g., 20 iterations) number of iterations may be set to ensure a minimum amount of smoothing and to limit processing time, respectively.

The apparatus 300 for processing 3D media streams may further be operative, after the iteratively dilating, smoothing, and eroding, the binarized depth frame, to erode and blur the binarized depth frame. Advantageously, the final erosion reduces a growing effect that the smoothing may induce. The final blurring corrects any irregularities stemming from the final erosion step.

The apparatus 300 for processing 3D media streams may further be operative, before binarizing the depth frame, to remove depth values for pixels or points which are located within the relevant region, and which have depth values which differ by more than a gradient threshold value from the depth value of at least one of its neighboring pixels. This amounts to removing 211 steep boundary points. Depth values which represent steep depth transitions, and which are outside the relevant region, are kept. This is to ensure that depth values belonging to the captured at least part of a person, in particular in areas like the chin, neck, or nose, are not removed. The gradient threshold value may, e.g., be set to 1% of the maximum depth value of the captured depth frame 201 . In this way, values presenting relatively steep depth transitions are removed from the depth frame. Such depth values typically appear in depth frames from some commercial RGB-D cameras and can severely distort the resulting 3D projection by introducing artifacts such as singular points floating in space between adjacent surfaces. These may be caused by the low depth-resolution of RGB-D cameras, which result in smooth slopes instead of hard transitions between adjacent surfaces. As an optional step, the apparatus 300 may be operative to perform a closing operation, a dilation followed by an erosion, before and/or after removing depth values for pixels or points which are located within the relevant region, and which have depth values which differ by more than a gradient threshold value from the depth value of at least one of its neighboring pixels. This has the effect of reducing any small holes that may be created as a consequence of removing dept values.

Subsequent to the processing to smoothen the depth frame, the depth frame is re-binarized, the final contours are calculated, and the silhouette contour is identified and filled to produce a resulting smoothed silhouette, which is exemplified in the right image of Fig. 3.

Fig. 4 exemplifies silhouette smoothing with different numbers of iterations, in accordance with embodiments of the invention. The left image shows the original silhouette (corresponding to the right image of Fig. 3), the center image shows the silhouette after 100 iterations of dilating, smoothing, and eroding, the binarized depth frame, and the right image shows the silhouette after 1000 iterations of dilating, smoothing, and eroding, the binarized depth frame.

The apparatus 300 for processing 3D media streams is further operative to determine 214 a relevant depth range for the captured depth frame 201. The relevant depth range encompasses depth values which represent the captured at least part of a person.

The apparatus 300 may be operative to determine 214 the relevant depth range for the captured depth frame 201 by identifying at least one center-of-mass in the depth values in the captured depth frame 201 within the silhouette of the at last part of a person, and setting the relevant depth range to encompass the at least one center-of-mass. Preferably, the relevant depth range is centered around the center-of-mass, if a single center-of-mass is determined. If several centers-of mass are determined, the relevant depth range may be adapted to each depth value by determining the nearest center-of-mass for that specific depth value. As an alternative, the relevant depth range may be set to encompass the minima and the maxima of all centers-of-mass.

The apparatus 300 may be operative to determine 214 the relevant depth range for the captured depth frame 201 by setting the relevant depth range to a fraction of the actual range of depth values in the captured depth frame 201 . For example, the relevant depth range may be set to +/-20% of the actual depth range of dept values in the captured depth frame 201 , centered at the center-of-mass.

Alternatively, the apparatus 300 may be operative to determine the relevant depth range for the captured depth frame 201 by setting the relevant depth range to an absolute distance. In particular, the relevant depth range may be set to an absolute distance which is commensurate with dimensions of the captured person, or generally a person. In practice, if the captured depth frame 201 represents absolute depth values, or can be converted into absolute depth values (e.g., if the depth range of the depth sensor 112 can be configured or can be retrieved, e.g., through an API of the depth sensor 112), the relevant depth range may be set using an absolute distance, such as +/-30 cm centered around the center-of-mass.

The apparatus 300 for processing 3D media streams is further operative to generate a smoothed depth frame 202 by selectively updating 215 depth values in the captured depth frame 201 based on the relevant depth range and the relevant region.

The apparatus 300 may be operative to selectively update 215 depth values in the captured depth frame 201 based on the relevant depth range and the relevant region by one or more of the following operations:

- Removing depth values for pixels or points of the captured depth frame 201 which are located within the relevant region, but outside the smoothed silhouette, and with depth values within the relevant depth range. In practice, removing depth values is achieved by setting the value in the depth frame to a value which indicates "no depth value". This may, e.g., be zero or any other reserved value, such as NaN or a maximum depth value.

- Adding missing depth values for pixels or points located within the relevant region, by setting the missing depth values as weighted averages of depth values of neighboring pixels.

- Removing depth values which are outside the relevant depth range. Alternatively, these can be left unchanged.

- Depth values for pixels or points which are located outside the relevant region, and within the relevant depth range, are left unchanged.

Embodiments of the invention are advantageous in that the obtained smoothed depth frames leads to a more visually pleasing 3D representations of captured persons with smooth, natural boundaries in the relevant regions of the depth frames, i.e. , the regions where parts of captured persons are located. By applying silhouette-based depth frame smoothing only to pixels or points with depth values in the relevant depth range, and which are located in the relevant region of a depth frame, the smoothing primarily affects the head and upper body of the captured person, and not any other elements such as tables, laptops etc.

More specifically, embodiments of the invention result in 3D meshes which have fewer errors and smoother boundaries, leading to more realistic 3D representations. As the same time, embodiments of the invention are computationally not very demanding and can be applied in real time. Further advantageously, the depth-frame processing described herein, in particular the depth-frame smoothening, does not rely on other input data like RGB frames which need to be aligned with the captured depth frame, thereby eliminating a common source of errors in known solutions. The described solution is applicable to depth frames captured by a wide range of different RGB-D cameras.

The apparatus 300 for processing 3D media streams may further be operative to receive an RGB frame captured by an RGB sensor 111 , create an uncolored 3D mesh by projecting 120 the captured depth frame 201 into 3D space, and color 130 the uncolored 3D mesh using the captured RGB frame to obtain a colored 3D mesh. Optionally, the apparatus 300 may further be operative to render the colored 3D mesh using a display 140.

In the following, embodiments of a method 600 of processing 3D media streams are described with reference to Fig. 6. The method 600 may, e.g., be performed by a computing device, such as the apparatus 300 for processing 3D media streams.

The method 600 comprises receiving 601 a depth frame captured by a depth sensor. The method 600 further comprises determining 603 a relevant region for the captured depth frame. The relevant region encompasses a silhouette of at least part of a person captured in the depth frame. The method 600 further comprises generating 604 a smoothed silhouette by processing the silhouette of the at least part of a person. The smoothed silhouette has smoother borders than the silhouette comprised in the captured depth frame. The method 600 further comprises determining 605 a relevant depth range for the captured depth frame. The relevant depth range encompasses depth values representing the captured at least part of a person. The method 600 further comprises generating 606 a smoothed depth frame based on the relevant depth range and the relevant region. The smoothed depth frame is generated by selectively updating depth values in the captured depth frame.

The selectively updating 606 depth values in the captured depth frame based on the relevant depth range and the relevant region may comprise one or more of: - Removing depth values for pixels located within the relevant region, but outside the smoothed silhouette, and with depth values within the relevant depth range,

- Adding missing depth values for pixels located within the relevant region, by setting the missing depth values as weighted averages of depth values of neighboring pixels, and

- Removing depth values which are outside the relevant depth range.

The determining 605 a relevant depth range for the captured depth frame may comprise identifying at least one center-of-mass in the depth values in the captured depth frame within the silhouette of the at last part of a person, and setting the relevant depth range to encompass the at least one center-of-mass.

The determining 605 the relevant depth range for the captured depth frame may comprise setting the relevant depth range to a fraction of the actual range of depth values in the captured depth frame. Alternatively, the determining 605 the relevant depth range for the captured depth frame may comprise setting the relevant depth range to an absolute distance.

The determining 603 a relevant region for the captured depth frame may comprise identifying a boundary between a foreground and a background in the captured depth frame. Optionally, the identifying the boundary between a foreground and a background in the captured depth frame may comprise determining a difference between morphological dilation and erosion of the captured depth frame.

The determining 603 a relevant region for the captured depth frame may comprise excluding a lower part of the depth frame.

The generating 606 the smoothed silhouette may comprise binarizing the depth frame, and iteratively dilating, smoothing, and eroding, the binarized depth frame. Optionally, the iteratively dilating, smoothing, and eroding, the binarized depth frame is terminated when a fraction of the smoothed silhouette in the dilated, smoothed, and eroded, binarized depth frame extending outside the relevant region exceeds a threshold.

Optionally, the method 600 may further comprise, after the iteratively dilating, smoothing, and eroding, the binarized depth frame, eroding and blurring the binarized depth frame.

Optionally, the method 600 may further comprise, before binarizing the depth frame, removing depth 602 values for pixels located within the relevant region and with depth values which differ by more than a gradient threshold value from the depth value of at least one of its neighboring pixels.

Optionally, the method 600 may further comprise receiving 607 an RGB frame captured by an RGB sensor, creating an uncolored 3D mesh by projecting 608 the captured depth frame into 3D space, and coloring 609 the uncolored 3D mesh using the captured RGB frame to obtain a colored 3D mesh.

Optionally, the method 600 may further comprise rendering 610 the colored 3D mesh using a 3D display.

It will be appreciated that the method 600 may comprise additional, alternative, or modified, steps in accordance with what is described throughout this disclosure. An embodiment of the method 600 may be implemented as the computer program 303 comprising instructions which, when the computer program 303 is executed by one or more processor(s) 301 comprised in the computing device, such as the apparatus 300 for processing3D media streams, cause the computing device to carry out the method 600 and become operative in accordance with embodiments of the invention described herein. The computer program 303 may be stored in a computer-readable data carrier, such as the memory 302. Alternatively, the computer program 303 may be carried by a data carrier signal, e.g., downloaded to the memory 302 via the network interface circuitry 304. The person skilled in the art realizes that the invention by no means is limited to the embodiments described above. On the contrary, many modifications and variations are possible within the scope of the appended claims.

Claims

1 . An apparatus (300) for processing 3D media streams, operative to: receive a depth frame (201 ) captured by a depth sensor (112), determine (212) a relevant region for the captured depth frame, the relevant region encompassing a silhouette of at least part of a person captured in the depth frame, generate (213) a smoothed silhouette by processing the silhouette of the at least part of a person, the smoothed silhouette having smoother borders than the silhouette comprised in the captured depth frame, determine (214) a relevant depth range for the captured depth frame, the relevant depth range encompassing depth values representing the captured at least part of a person, and generate a smoothed depth frame (202) by selectively updating (215) depth values in the captured depth frame based on the relevant depth range and the relevant region.

2. The apparatus (300) according to claim 1 , operative to selectively update (215) depth values in the captured depth frame based on the relevant depth range and the relevant region by one or more of: removing depth values for pixels located within the relevant region, but outside the smoothed silhouette, and with depth values within the relevant depth range, adding missing depth values for pixels located within the relevant region, by setting the missing depth values as weighted averages of depth values of neighboring pixels, and removing depth values which are outside the relevant depth range.

3. The apparatus (300) according to claim 1 or 2, operative to determine (214) the relevant depth range for the captured depth frame by: identifying at least one center-of-mass in the depth values in the captured depth frame within the silhouette of the at last part of a person, and setting the relevant depth range to encompass the at least one center- of-mass.

4. The apparatus (300) according to any one of claims 1 to 3, operative to determine (214) the relevant depth range for the captured depth frame by setting the relevant depth range to a fraction of the actual range of depth values in the captured depth frame.

5. The apparatus (300) according to any one of claims 1 to 3, operative to determine (214) the relevant depth range for the captured depth frame by setting the relevant depth range to an absolute distance.

6. The apparatus (300) according to any one of claims 1 to 5, operative to determine (212) the relevant region for the captured depth frame by identifying a boundary between a foreground and a background in the captured depth frame.

7. The apparatus (300) according to claim 6, operative to identify the boundary between a foreground and a background in the captured depth frame by determining a difference between morphological dilation and erosion of the captured depth frame.

8. The apparatus (300) according to any one of claims 1 to 7, operative to determine (212) the relevant region for the captured depth frame by excluding a lower part of the depth frame.

9. The apparatus (300) according to any one of claims 1 to 8, operative to generate (213) the smoothed silhouette by: binarizing the depth frame, and iteratively dilating, smoothing, and eroding, the binarized depth frame.

10. The apparatus (300) according to claim 9, operative to terminate the iteratively dilating, smoothing, and eroding, the binarized depth frame when a fraction of the smoothed silhouette in the dilated, smoothed, and eroded, binarized depth frame extending outside the relevant region exceeds a threshold.

11 . The apparatus (300) according to claim 9 or 10, further operative, after the iteratively dilating, smoothing, and eroding, the binarized depth frame, to erode and blur the binarized depth frame.

12. The apparatus (300) according to any one of claims 9 to 11 , further operative, before binarizing the depth frame, to remove (211 ) depth values for pixels located within the relevant region and with depth values which differ by more than a gradient threshold value from the depth value of at least one of its neighboring pixels.

13. The apparatus (300) according to any one of claims 1 to 12, further operative to: receive an RGB frame captured by an RGB sensor (111 ), create an uncolored 3D mesh by projecting (120) the captured depth frame into 3D space, and color (130) the uncolored 3D mesh using the captured RGB frame to obtain a colored 3D mesh.

14. The apparatus (300) according to claim 13, further operative to: render the colored 3D mesh using a 3D display (140).

15. A method (600) of processing 3D media streams, comprising: receiving (601 ) a depth frame captured by a depth sensor, determining (603) a relevant region for the captured depth frame, the relevant region encompassing a silhouette of at least part of a person captured in the depth frame, generating (604) a smoothed silhouette by processing the silhouette of the at least part of a person, the smoothed silhouette having smoother borders than the silhouette comprised in the captured depth frame, determining (605) a relevant depth range for the captured depth frame, the relevant depth range encompassing depth values representing the captured at least part of a person, and generating (606) a smoothed depth frame by selectively updating depth values in the captured depth frame based on the relevant depth range and the relevant region.

16. The method (600) according to claim 15, wherein the selectively updating (606) depth values in the captured depth frame based on the relevant depth range and the relevant region comprises one or more of: removing depth values for pixels located within the relevant region, but outside the smoothed silhouette, and with depth values within the relevant depth range, adding missing depth values for pixels located within the relevant region, by setting the missing depth values as weighted averages of depth values of neighboring pixels, and removing depth values which are outside the relevant depth range.

17. The method (600) according to claim 15 or 16, wherein the determining (605) a relevant depth range for the captured depth frame comprises: identifying at least one center-of-mass in the depth values in the captured depth frame within the silhouette of the at last part of a person, and setting the relevant depth range to encompass the at least one center- of-mass.

18. The method (600) according to any one of claims 15 to 17, wherein the determining (605) the relevant depth range for the captured depth frame comprises setting the relevant depth range to a fraction of the actual range of depth values in the captured depth frame.

19. The method (600) according to any one of claims 15 to 17, wherein the determining (605) the relevant depth range for the captured depth frame comprises setting the relevant depth range to an absolute distance.

20. The method (600) according to any one of claims 15 to 19, wherein the determining (603) a relevant region for the captured depth frame comprises identifying a boundary between a foreground and a background in the captured depth frame.

21 . The method (600) according to claim 20, wherein the identifying the boundary between a foreground and a background in the captured depth frame comprises determining a difference between morphological dilation and erosion of the captured depth frame.

22. The method (600) according to any one of claims 15 to 21 , wherein the determining (603) a relevant region for the captured depth frame comprises excluding a lower part of the depth frame.

23. The method (600) according to any one of claims 15 to 22, wherein the generating (606) the smoothed silhouette comprises: binarizing the depth frame, and iteratively dilating, smoothing, and eroding, the binarized depth frame.

24. The method (600) according to claim 23, wherein the iteratively dilating, smoothing, and eroding, the binarized depth frame is terminated when a fraction of the smoothed silhouette in the dilated, smoothed, and eroded, binarized depth frame extending outside the relevant region exceeds a threshold.

25. The method (600) according to claim 23 or 24, further comprising, after the iteratively dilating, smoothing, and eroding, the binarized depth frame, eroding and blurring the binarized depth frame.

26. The method (600) according to any one of claims 23 to 25, further comprising, before binarizing the depth frame, removing depth (602) values for pixels located within the relevant region and with depth values which differ by more than a gradient threshold value from the depth value of at least one of its neighboring pixels.

27. The method (600) according to any one of claims 15 to 26, further comprising: receiving (607) an RGB frame captured by an RGB sensor, creating an uncolored 3D mesh by projecting (608) the captured depth frame into 3D space, and coloring (609) the uncolored 3D mesh using the captured RGB frame to obtain a colored 3D mesh.

28. The method (600) according to claim 27, further comprising: rendering (610) the colored 3D mesh using a 3D display.

29. A computer program (303) comprising instructions which, when the computer program (303) is executed by one or more processors (301) comprised in a computing device (300), cause computing device (300) to carry out the method (600) according to any one of claims 15 to 28.

30. A computer-readable data carrier (302) having stored thereon the computer program (303) according to claim 29.

31 . A data carrier signal carrying the computer program (303) according to claim 29.