HK1261742A1 - Depth map delivery formats for stereoscopic and auto-stereoscopic displays - Google Patents
Depth map delivery formats for stereoscopic and auto-stereoscopic displays Download PDFInfo
- Publication number
- HK1261742A1 HK1261742A1 HK19121636.5A HK19121636A HK1261742A1 HK 1261742 A1 HK1261742 A1 HK 1261742A1 HK 19121636 A HK19121636 A HK 19121636A HK 1261742 A1 HK1261742 A1 HK 1261742A1
- Authority
- HK
- Hong Kong
- Prior art keywords
- picture
- data
- depth map
- depth
- view
- Prior art date
Links
Description
This application claims priority to United States Provisional Patent Application No. 61/659,588 filed on 14 June 2012 ; United States Provisional Patent Application No. 61/712,131 filed on 10 October 2012 ; United States Provisional Patent Application No. 61/739,886 filed on 20 December 2012 ; United States Provisional Patent Application No. 61/767,416 filed on 21 February 2013 ; United States Provisional Patent Application No. 61/807,013 filed on 1 April 2013 ; United States Provisional Patent Application No. 61/807,668 filed on 2 April 2013 ; and United States Provisional Patent Application No. 61/822,060 filed on 10 May 2013 .
This application is a European divisional application of Euro-PCT patent application EP 13732024.8 (reference: D12128EP01), filed 12 June 2013.
The invention is defined by the subject-matter according to the independent claims. Further aspects of the invention are defined according to the dependent claims. References to embodiments which do not fall under the scope of the claims are to be understood as examples useful for understanding the invention.
3D video systems garner great interest for enhancing a consumer's experience, whether at the cinema or in the home. These systems use stereoscopic or auto-stereoscopic methods of presentation, including:
- (i) anaglyph - provides left/right eye separation by filtering the light through a two color filter, commonly red for one eye, and cyan for the other eye;
- (ii) linear polarization - provides separation at the projector by filtering the left eye through a linear polarizer (commonly) oriented vertically, and filtering the right eye image through a linear polarizer oriented horizontally;
- (iii) circular polarization - provides separation at the projector by filtering the left eye image through a (commonly) left handed circular polarizer, and filtering the right eye image through a right handed circular polarizer;
- (iv) shutter glasses - provides separation by multiplexing the left and right images in time, and
- (v) spectral separation - provides separation at the projector by filtering the left and right eye spectrally where the left and right eye each receives a complementary portion of the red, green, and blue spectrums.
Most of the 3D displays available in the market today are stereoscopic TVs, requiring the user to wear special 3D glasses in order to experience the 3D effect. Delivery of 3D content to these displays only requires carrying two separate views: a left view and a right view. Auto-stereoscopic (glasses-free) displays are in the horizon. These displays provide some amount of motion parallax; the viewer can move his/her head around as if they are viewing objects from different angles as they move around.
Traditional stereoscopic displays provide a single 3D view; however, auto-stereoscopic displays are required to provide multiple views such as five views, nine views, 28 views, etc., based on the design of the display. When regular stereoscopic content is provided to auto-stereoscopic displays, the displays extract depth maps and create or render multiple views based on this depth map. As used herein, the term "depth map" denotes an image or other bit-stream that contains information related to the distance of the surfaces of scene objects from a viewpoint. A depth map can be readily converted to a disparity map, and in the context of this document the terms depth map and disparity map are the same and inter-changeable.
The depth map also may be used for retargeting the 3D experience for different displays types with different resolutions (e.g., 1080p displays or 2K displays). There have been a number of studies showing the amount of depth designed for 3D Cinema is not suitable for smaller mobile devices and vice-versa. Also there is viewer preference to the amount of 3D depth, which can be age-dependent (the young prefer a larger depth experience than the old), culture-dependent (Asian cultures prefer higher depth than Western cultures), or simply viewer dependent. The depth map information could be used to re-render the stereo views to increase or decrease the perceived depth and other adjustments. As appreciated by the inventors here, improved techniques for delivering depth map information along with the content are desirable for improving the user experience with auto-stereoscopic and stereoscopic displays. It is further appreciated that these improved techniques preferably are backwards compatible with existing single-view and 3D systems.
The approaches described in this section are approaches that could be pursued, but not necessarily approaches that have been previously conceived or pursued. Therefore, unless otherwise indicated, it should not be assumed that any of the approaches described in this section qualify as prior art merely by virtue of their inclusion in this section. Similarly, issues identified with respect to one or more approaches should not assume to have been recognized in any prior art on the basis of this section, unless otherwise indicated.
Document WO 2012/012584 A1 discloses a multi-layered frame-compatible video delivery that achieves full resolution 3D delivery by means of a scalable video coder for multi-view images with a stereocopic frame compatible as base layer and as enhancement layers.
Document "Joint texture and depth map video coding based on the scalable extension of H.264/AVC" from Siping Tao et A1 discloses a joint 2D and depth data coding based on the scalable extension of H.264/AVC, wherein the 2D video is coded as base layer and depth data as enhancement layer via inter-ayer prediction tools.
An embodiment of the present invention is illustrated by way of example, and not in way by limitation, in the figures of the accompanying drawings and in which like reference numerals refer to similar elements and in which:
- FIG. 1A and FIG. 1B depict example Frame-Compatible-Full-Resolution (FCFR) encoders and decoders for 3D video;
- FIG. 1C depicts a simplified representation of a 3D FCFR format with no depth data; FIG. 1D depicts a simplified representation of the corresponding decoder;
- FIG. 2A depicts an example 3-layer depth map delivery format according to an embodiment of the present invention, where the base layer comprises a side-by-side 3D signal; FIG. 2B depicts examples of corresponding bitstreams that can be extracted by suitable decoders;
- FIG. 2C depicts an example 3-layer depth map delivery format according to an embodiment of the present invention, where the base layer comprises a top-and-bottom 3D signal;
- FIG. 3A depicts an example 3-layer depth map delivery format according to an embodiment of the present invention; FIG. 3B depicts examples of corresponding bitstreams that can be extracted by suitable decoders;
- FIG. 4A depicts an example 3-layer depth map delivery format according to an embodiment of the present invention; FIG. 4B depicts examples of corresponding bitstreams that can be extracted by suitable decoders;
- FIG. 5 depicts an example single-layer depth map delivery format according to an embodiment of the present invention;
- FIG. 6 depicts an example dual-layer depth map delivery format according to an embodiment of the present invention;
- FIG. 7A depicts an example 2-layer depth map delivery format according to an embodiment of the present invention; FIG. 7B depicts examples of corresponding bitstreams that can be extracted by suitable decoders;
- FIG. 8A depicts an example 3-layer depth map delivery format according to an embodiment of the present invention; FIG. 8B depicts examples of corresponding bitstreams that can be extracted by suitable decoders;
- FIG. 9A depicts an example 3-layer depth map delivery format according to an embodiment of the present invention; FIG. 9B depicts examples of corresponding bitstreams that can be extracted by suitable decoders;
- FIG. 10A depicts an example 2-layer depth map delivery format according to an embodiment of the present invention; FIG. 10B depicts examples of corresponding bitstreams that can be extracted by suitable decoders;
- FIG. 11A depicts an example 2-layer depth map delivery format according to an embodiment of the present invention; FIG. 11B depicts examples of corresponding bitstreams that can be extracted by suitable decoders;
- FIG. 12A and FIG. 12B depict examples of single layer depth map delivery formats according to embodiments of the present invention;
- FIG. 13A depicts an example 2-layer depth map delivery format according to an embodiment of the present invention; FIG. 13B depicts examples of corresponding bitstreams that can be extracted by suitable decoders;
- FIG. 14 depicts an example single layer depth map delivery format according to an embodiment of the present invention;
- FIG. 15A and FIG. 15B depict example single layer depth map delivery formats according to embodiments of the present invention.
- FIG. 15C depicts an example of segmented depth map multiplexing according to an embodiment of the present invention.
- Figures 16A-16E depict example 3-layer depth map delivery formats according to embodiments of the present invention.
- Figures 17A-17B depict example 2-layer depth map delivery formats according to embodiments of the present invention.
Delivery formats for depth maps for stereoscopic and auto-stereoscopic displays are described herein. The formats support a variety of video delivery scenarios, including traditional cable, satellite, or over the air broadcasting and over-the-top delivery. In some embodiments, the formats allow legacy decoders to extract a backwards-compatible 2D or 3D stream while newer decoders can render multiple views and associated depth map data for either stereoscopic or auto-stereoscopic displays. In the following description, for the purposes of explanation, numerous specific details are set forth in order to provide a thorough understanding of the present invention. It will be apparent, however, that the present invention may be practiced without these specific details. In other instances, well-known structures and devices are not described in exhaustive detail, in order to avoid unnecessarily obscuring the present invention.
Example embodiments described herein relate to delivery formats for depth map information for stereoscopic and auto-stereoscopic displays. Given a 3D input picture and corresponding input depth map data, a side-by-side and a top-and-bottom picture are generated based on the input picture. Using an encoder, the side-by-side picture is coded to generate a coded base layer Using the encoder and a texture reference processing unit (RPU), the top-and-bottom picture is encoded to generate a first enhancement layer, wherein the first enhancement layer is coded partially based on the base layer stream. Using the encoder and a depth-map RPU (denoted as Z-RPU or RPUz in the following), depth data for the side-by-side picture are encoded to generate a second enhancement layer, wherein the second enhancement layer is partially coded based on to the base layer.
In some embodiments, instead of coding directly depth map data into the base and enhancement layers, the encoder may encode residual depth map data, the residual depth data comprising differences between the input depth map data and estimated depth map data generated by a Z-RPU.
In some embodiments, depth map data and video data are encoded into a single layer, the single layer comprising half-resolution data of a first view and either half resolution data of the second view or depth map data for the half-resolution data of the first view;
In some embodiments, depth map data and video data are encoded into two base layers. A first base layer comprises full resolution data of a first view, while a second base layer comprises either full resolution data of a second view or full-resolution depth data of the first view.
In some embodiments, depth map data and video data are encoded in three layers. The base layer comprises half-resolution data of a first view and its corresponding depth map data. A first enhancement layer comprises a top-and-bottom picture, and a second enhancement layer comprises half-resolution data of a second view and its corresponding depth map data.
In some embodiments, depth map data and video data are encoded into two layers. The base layer comprises both the luminance and the chroma components of a side-by-side picture. The enhancement layer's luma component comprises the luma components of a top-and-bottom picture, and the enhancement layer's chroma component comprises depth map data for the top-and-bottom picture.
In some embodiments, the side-by-side picture and the top-to-bottom pictures are padded so their horizontal and vertical spatial dimensions are integer multiples of a predefined macroblock size (e.g., 16). Then, the padded data comprise sub-sampled versions of the original depth map data.
In some embodiments, chroma pixel values in an enhancement layer may also be replaced with depth map data or residual depth map data.
As depicted in FIG. 1A , full resolution (e.g., 1920x1080), left and right views (105-1, 105-2) of an input 3D signal (105) are filtered, sub-sampled (horizontally or vertically), and multiplexed to generate a side-by-side view 112 and top-and-bottom view 117. The side-by-side and top-and-bottom pictures comprise both views of the input; but each view is at a lower resolution. For example, for a 1920x1080 input, the side-by-side sub-pictures (L, R) maybe 960x1080 each, and the top-and-bottom sub-pictures (L', R') maybe 1920x540 each. The side-by-side signal 112 is encoded by BL encoder 120 to generate a coded base layer (BL) bit-stream 122. BL encoder 120 may be any of the known video encoders, such as those specified by the ISO/IEC MPEG-2, MPEG-4 part 2, or H.264 (AVC) standards, or other encoders, such as Google's VP8, Microsoft's VC-1, HEVC, and the like.
Top-and-bottom signal 117 may be encoded by a second encoder, enhancement layer (EL) encoder 130, to generate coded enhancement layer (EL) stream 132. EL encoder may encode in the same format as BL encoder 120 (e.g., H.264), or in a separate format. In some embodiments, EL encoder 130 may encode signal 117 by using reference frames from both the top-and-bottom signal 117 and the side-by-side signal 112. For example, BL encoder 120, EL Encoder 130, and associated storage (not shown), may comprise a multi-view codec as specified by the ISO/IEC H.264 specification for a multi-view codec (MVC).
In some embodiments, the encoder of FIG. 1A may also include a Reference Processor Unit (RPU) 125. As used herein in relation to the RPU, the term "Reference" is not meant to imply or express, and should not be interpreted as meaning, that this picture is explicitly used as a reference within the complete coding process (e.g., in the sense of a "reference picture"). The RPU may conform to a description set forth in the following two patent application publications, filed pursuant to the Patent Cooperation Treaty (PCT): (1) WO 2010/123909 A1 by Tourapis, A., et al. for "Directed Interpolation/Post-processing Methods for Video Encoded Data"; and (2) WO 2011/005624 A1 by Tourapis, A., et al. for "Encoding and Decoding Architecture for Frame Compatible 3D Video Delivery." The following descriptions of the RPU apply, unless otherwise specified to the contrary, both to the RPU of an encoder and to the RPU of a decoder. Artisans of ordinary skill in fields that relate to video coding will understand the differences, and will be capable of distinguishing between encoder-specific, decoder-specific and generic RPU descriptions, functions and processes upon reading of the present disclosure. Within the context of a 3D video coding system as depicted in FIG. 1A , the RPU (125) accesses and interpolates decoded images from BL Encoder 120, according to a set of rules of selecting different RPU filters and processes.
The RPU 125 enables the interpolation process to be adaptive at a region level, where each region of the picture/sequence is interpolated according to the characteristics of that region. RPU 125 can use horizontal, vertical, or two dimensional (2D) filters, edge adaptive or frequency based region-dependent filters, and/or pixel replication filters or other methods or means for interpolation and image processing.
For example, one pixel replication filter may simply perform a zero-order-hold, e.g., each sample in the interpolated image will be equal to the value of a neighboring sample in a low resolution image. Another pixel replication filter may perform a cross-view copy operation, e.g., each interpolated sample in one view, will be equal to the non-interpolated co-located sample from the opposing view.
Additionally or alternatively, a disparity-compensated copy scheme can also be used in the RPU. For example, the filter may copy a non-collocated region of samples where the location of the region to be copied, which may also be a region from a different view, can be specified using a disparity vector. The disparity vector may be specified using integer or sub-pixel accuracy and may involve simple, e.g. translational motion parameter, or more complex motion models such as affine or perspective motion information and/or others.
An encoder may select RPU filters and outputs regional processing signals, which are provided as input data to a decoder RPU (e.g., 140). The signaling (e.g., RPUL 127) specifies the filtering method on a per-region basis. For example, parameters that relate to region attributes such as the number, size, shape and other characteristics are may be specified in an RPUL related data header. Some of the filters may comprise fixed filter coefficients, in which case the filter coefficients need not be explicitly signaled by the RPU. Other filter modes may comprise explicit modes, in which the filter parameters, such as coefficient values and number of horizontal/vertical taps are signaled explicitly.
The filters may also be specified per each color component. The RPU may specify linear filters. Non-linear filters such as edge-adaptive filters, bi-lateral filters, etc., may also be specified in the RPU. Moreover, prediction models that specify advanced motion compensation methods such as the affine or perspective motion models may also be signaled.
The RPU data signaling 127 can either be embedded in the encoded bitstream, or transmitted separately to the decoder. The RPU data may be signaled along with the layer on which the RPU processing is performed. Additionally or alternatively, the RPU data of all layers may be signaled within one RPU data packet, which is embedded in the bitstream either prior to or subsequent to embedding the layer 2 encoded data. The provision of RPU data may be optional for a given layer. In the event that RPU data is not available, a default scheme may thus be used for up-conversion of that layer. Not dissimilarly, the provision of an enhancement layer encoded bitstream is also optional.
An embodiment allows for multiple possible methods of optimally selecting the filters and filtered regions in each RPU. A number of criteria may be used separately or in conjunction in determining the optimal RPU selection. The optimal RPU selection criteria may include the decoded quality of the base layer bitstream, the decoded quality of the enhancement layer bitstreams, the bit rate required for the encoding of each layer including the RPU data, and/or the complexity of decoding and RPU processing of the data.
An RPU may be optimized independently of subsequent processing in the enhancement layer. Thus, the optimal filter selection for an RPU may be determined such that the prediction error between the interpolated base layer images and the original left and right eye images is minimized, subject to other constraints such as bitrate and filter complexity.
The RPU 125 may serve as a pre-processing stage that processes information from BL encoder 120, before utilizing this information as a potential predictor for the enhancement layer in EL encoder 130. Information related to the RPU processing may be communicated (e.g., as metadata) to a decoder as depicted in FIG. 1B using an RPU Layer (RPUL) stream 127. RPU processing may comprise a variety of image processing operations, such as: color space transformations, non-linear quantization, luma and chroma up-sampling, and filtering. In a typical implementation, the EL 132, BL 122, and RPUL 127 signals are multiplexed into a single coded bitstream (not shown).
BL decoder 135 (e.g., an MPEG-2 or H.264 decoder) corresponds to the BL encoder 120. EL decoder 145 (e.g., an MPEG-2 or H.264 decoder) corresponds to the EL Encoder 130. Decoder RPU 140 corresponds to the encoder RPU 125, and with guidance from RPUL input 127, may assist in the decoding of the EL layer 132 by performing operations corresponding to operations performed by the encoder RPU 125.
Given the coded bitstream generated by the encoder representation depicted in FIG. 1C, FIG. 1D depicts a simplified representation for the corresponding receiver embodiments. FIG. 1D can also be viewed as a simplified version of FIG. 1B . As explained before, a legacy decoder with a single BL decoder 135 can extract from this stream a legacy (e.g., half-resolution) frame compatible (FC) 3D stream, while a newer decoder (e.g., an H.264 MVC decoder, or a decoder with an EL decoder 145 and an RPU 140) may also extract the enhancement layer and thus reconstruct a higher-resolution and quality FCFR 3D stream. For notation purposes, a connection (e.g., 137) between two decoders, such as between BL decoder 135 and EL decoder 145, denotes that the EL decoder may utilize as reference frames, frames extracted and post-processed from the base layer, for example through a decoder RPU 140 (not shown). In other words, the coded EL stream is partially decoded based on data from the BL stream.
EL-2 layer 219S may be encoded on its own using a second EL encoder, or as depicted in FIG. 2A , it can be encoded using RPUz 230 by referencing depth data extracted from the BL stream 212.
Depth-map RPU 230 (also to be referred as RPUz or Z-RPU, because it operates on depth or Z-buffer data) is very similar in operation and functionality to texture RPU 225 (or RPU 125) (also to be referred as RPUT because it operates on texture data), except it has the added functionality to extract (or predict) estimate depth-map data from a baseline input (e.g., BL 212). Depth map information can be extracted from 2D or 3D data using any of the known techniques in the art, such as, "High-Accuracy Stereo Depth Maps Using Structured Light," by Daniel Scharstein and Richard Szeliski, published in IEEE Computer Society Conference on Computer Vision and Pattern Recognition, vol. 1, pp. 195-202, June 2003.
In some embodiments, the EL-2 layer 219S may carry the following data: original depth map without any modification (e.g., depth map as captured by a camera), or the difference between the original depth map and a depth map predicted by RPUz, or specific regions from an original depth map. The same format may also be used to carry various parameters needed for defining the RPUz processing, either as part of the depth data or as part of a separate RPUz bit stream, similar to the RPUT bit stream (e.g., 127).
Given the depth map coding format of FIG. 2A , depending on the capabilities of a receiver, FIG. 2B depicts a number of alternative decoded bit streams. For example, a receiver with a single decoder, BL decoder 250, can extract only a frame compatible (FC) 3D stream. A receiver with both BL decoder 250 and an EL decoder-1 255 (e.g., an MVC decoder) can also decode an FCFR 3D stream. A receiver with a second EL-decoder (265) and a decoder RPUz (not shown) may also decode the depth maps ZL and ZR. A receiver with BL decoder 250 and only EL decoder 2 (265), may decode an FC 3D stream and depth maps ZL and ZR.
As depicted in FIG. 2A , the base layer 212 comprises side-by-side multiplexed L/R coded data (e.g., 112) and the EL-1 layer comprises top-and-bottom L'/R' multiplexed data (e.g., 117); however, in all of the delivery formats for depth maps discussed herein, using side-by-side 3D data in the base layer is inter-changeable with using top-and-bottom 3D data. Hence, as depicted in FIG. 2C , in an alternative embodiment, BL may comprise the top-and-bottom L'/R' signal 217 (e.g., 117), EL-1 may comprise the side-by-side L/R signal 212 (e.g., 112), and EL-2 may comprise top-and-bottom depth map data ZL'/ZR' (219T). Similar embodiments may be derived for other example embodiments described in this specification.
In an embodiment, RPUz 330 may utilize information from base layer 312 to derive predicted depth data ZEL and ZER. Then, the encoder for BL-2, instead of coding directly ZL and ZR, may encode the depth residuals RZL = ZL-ZEL and RZR = ZR-ZER. Similar depth map residual coding is applicable to all example embodiments described in this specification.
Given depth map data encoded according to FIG. 3A , depending on the capabilities of the receiver, FIG. 3B depicts alternative decoding scenarios. For example, a receiver with a single BL decoder, BL Decoder-1 350, may decode an FC 3D stream. A receiver with a second BL decoder (BL-Decoder-2 360) may decode either depth data ZL and ZR or residual depth data (RZL, RZR). A receiver with the second BL-decoder 360 and a decoder RPUz may use the BL stream to reconstruct estimate depth data (ZEL and ZER), which can be added (e.g., via adder 365) to the decoded residual depth data (RZL, RZR) to generate output depth data ZL and ZR. Note that the additional function 365 may be implemented by the decoder's RPUz or by separate processing circuitry. Finally a receiver with BL-decoder-1 350 and EL-decoder 355 may use the bit stream EL-1 and reference data from the BL bit stream to reconstruct an FCFR 3D stream.
Given depth map data encoded according to FIG. 4A, FIG. 4B depicts alternative decoding scenarios. Receivers with a single BL decoder 450 may decode an FC 3D stream. Receivers with an additional EL decoder (455 or 460) and RPUT and RPUz (or similar) functionality can also decode either a full-resolution (FR) left view stream, a half-resolution (HR) right-view stream, and left view depth data (ZL), or they can decode an FR right view, an HR left view, and right view depth data (ZR). Receivers with two additional EL decoders (455 and 460) can also decode an FCFR 3D stream and the depth data from both views.
In some embodiments, RPUz 730 may be skipped all together, and the EL layer 717 may be encoded on its own, as a second base layer, with no reference to the base layer.
In some embodiments, RPUz 730 may utilize information from base layer 712 to extract estimate depth data ZEL and ZER. Then, enhancement layer 717, instead of comprising the original ZL and ZR depth data, it may instead comprise depth-map residual values, such as RZL = ZL-ZEL and RZR = ZR-ZER.
Given the encoder format depicted in FIG. 7A, FIG. 7B depicts alternative decoding embodiments. Receivers with a single BL decoder 735 may decode an FC 3D stream. Receivers with an additional EL decoder (745) may also decode the corresponding ZL and ZR depth map data.
In another embodiment, instead of using the side-by-side L/R data (e.g., 112) as BL layer 712, one may use the top-and-bottom L'/R' data (e.g., 117). In such an embodiment, the EL stream 717 will carry the corresponding top-and-bottom depth map data as well.
Most of the depth-map data delivery formats described so far allow legacy receivers to decode at least a backwards-compatible, half-resolution (FC) 3D stream. When backward compatibility with a single decoder is not a requirement, then alternative embodiments may be derived.
The same delivery format may also be used in alternative embodiments where in BL 512, the half-resolution left view (L) may be replaced by a half-resolution right view (R), or the top (L') of the top-and-bottom L'/R' signal (147), or the bottom (R') of the top-and-bottom L'/R' signal (147), and the left-view depth map is replaced by the corresponding depth-map.
Decoding this format requires at least two BL decoders; one for decoding the left-view data (L) and one for decoding either left-view depth map data or right-view data. Auxiliary data (or metadata) that contain information about the picture arrangements on a per picture basis may also be transmitted. This format allows a receiver with one decoder to reconstruct a 2D video and a receiver with two decoders to reconstruct an FCFR 3D or an FC 3D video.
In some embodiments, BL-1 (612) may carry the right view data (R) and BL-2 (617) may care either right-view depth data (ZR) or left-view data (L).
Given the delivery format depicted in FIG. 8A, FIG. 8B depicts alternative decoding scenarios using legacy and compatible decoders. A receiver with a single BL decoder 850 may extract a 2D stream. A decoder with an MVC decoder or with an EL-decoder 855 may extract an FCFR 3D stream. A decoder with an additional EL decoder 860 (or a 3-layer MVC decoder), may also extract the left-view and right-view depth map data. A decoder with a single BL decoder 850 and EL Decoder-2 may extract a 2D stream plus corresponding depth data.
Given the delivery format depicted in FIG. 9A, FIG. 9B depicts example of decoding scenarios in a receiver. A receiver with a single BL decoder 950 may decode a half-resolution (HR) left view and half-resolution ZL. A receiver with an additional EL decoder-1 955 can also decode the L'/R' top-and-bottom signal, thus, it can reconstruct a full-resolution left view (or FR right view), and a half resolution right view (or an HR left view); both of these signals can be used to recreate a 3D view. A receiver with a second EL decoder (e.g., 960) can also decode a half-resolution right-view R and a half-resolution ZR, thus being able to generate an FCFR 3D signal. A receiver with a BL decoder 950 and only the second EL-Decoder 960, may decode a frame-compatible 3D signal plus depth data.
On the receiver, as depicted in FIG. 10B , a receiver with a single BL decoder 1035 may decode a half-resolution left view and its depth map. A receiver with an additional EL decoder 1045 (e.g., an MVC decoder that may or may not include a receiver RPU 140) can also decode a half-resolution right view and its depth map. By combining the two views, the receiver can render a half-resolution (or frame-rate compatible) 3D signal.
In an alternative embodiment, in FIG. 10A , in the EL stream 1017, instead of transmitting the horizontal half-resolution R signal and horizontal half-resolution ZR, one may transmit the vertical half-resolution signal R' (e.g., the bottom of top-and-bottom signal 117) and a vertical half-resolution ZR'. The decoder operation remains the same.
As depicted in FIG. 11B , a receiver with a single BL decoder 1135 may decode an FC 3D signal. A receiver with a dual layer decoder may also decode the top-and-bottom L'/R' signal and the depth map data, thus being able to reconstruct an FCFR 3D signal and depth map data for both views.
As depicted in FIG. 12A , image data (e.g. L or R) and their corresponding depth data (e.g., ZL or ZR) may be vertically aligned. In another embodiment, depicted in FIG. 12B , image data and their corresponding depth data may also be aligned horizontally.
Some embodiments may skip the RPUz 1330 and encode depth-map data 1325 on its own as another base layer.
In some embodiments, RPUz 1330 may utilize information from base layer 1305 to extract estimate depth data ZEL and ZER. Then, enhancement layer 1325, instead of comprising the original ZL and ZR depth data, it may instead comprise depth-map residual values, such as RZL = ZL-ZEL and RZR = ZR-ZER.
Given the delivery format depicted in FIG. 13A, FIG. 13B depicts alternative receiver configurations. A receiver with a single BL decoder 1335 may decode a full-resolution 3D stream. A receiver with an additional EL decoder 1345 may also decode the corresponding depth data.
Given a multiplexed input frame (e.g., 1512) with a pixel resolution h x w (e.g., h = 1080 and w = 1920), in an embodiment, the sub-sampled left view (L) may be allocated more pixels than its associated depth map. Thus, given a scale a, where 1 > a ≥ 1/2, the original left view picture may be scaled (e.g., sub-sampled) to a size h x aw, while the depth map may be scaled to a size h x (1-a) w. This approach may result in sharper 3D pictures than symmetric left and right view pictures (e.g., when a = ½).
As discussed earlier, optionally, additional depth data (e.g., ZL' and ZL") may also be embedded in the corresponding chroma components of the coded frame (e.g., 1512-UV).
In an embodiment, backward compatibility may be achieved by defining the active area of the picture (e.g., h x aw) by using cropping rectangle and aspect ratio syntax parameters in the encoding bitstream, similar to those defined in AVC/H.264 or the upcoming HEVC video coding standard. Under such an implementation, a legacy 2D receiver may extract, decode, and display only the picture area (e.g., L) defined by these parameters and ignore the depth map information (e.g., ZL). Receivers with 3D capability may decode the whole picture, determine the picture areas and depth-map areas using the cropping parameters, and then use the depth map information to render multiple views. The 3D receiver can scale the 2D picture and depth as needed using the received cropping and aspect ratio parameters. Auxiliary data (or metadata) that contain information about the picture arrangements on a per picture basis may also be transmitted.
The same delivery format may also be used in alternative embodiments where in BL 1512, the sub-resolution left view (L) may be replaced by a sub-resolution right view (R), or scaled versions of the top (L') of the top-and-bottom L'/R' signal (147), or the bottom (R') of the top-and-bottom L'/R' signal (147), and the left-view depth map is replaced by the corresponding depth-map. In some embodiments (e.g., as shown in FIG. 4A and FIG. 15B ), the asymmetric spatial multiplexing may also be applied in the vertical direction. In some embodiments (not shown), the asymmetric spatial multiplexing may be applied to both the horizontal and vertical directions.
In an embodiment, FIG. 15C depicts an example of an alternative depth delivery format based on segmented depth maps. Such embodiments allow the aspect ratios of the transmitted depth maps to match more closely the aspect ratios of the transmitted image views. As an example, consider an input 1080 x 1920 image and an asymmetric multiplexing format as depicted in FIG. 15A , where, without limitation, a= 2/3. Then, in an embodiment, the luminance signal 1512-Y (or 1512C-Y) may comprise one view (e.g., the left view L) scaled at a 1080 x 1280 resolution, and the corresponding depth map (e.g., ZL) scaled at a 1080 x 640 resolution. In some embodiments, it may be more beneficial to transmit a 540 x 960 depth map, which better matches the original aspect ratio. Such a depth map may be segmented horizontally into two continuous parts (e.g., ZLA and ZLB), which, as depicted in FIG. 15C , may be multiplexed by stacking them one on top of the other. Hence, in an example embodiment, the luminance signal 1512C-YS may comprise two multiplexed parts: an image part (e.g., the left view L) scaled at a first resolution (e.g., 1080 x 1440) and two or more depth map segments multiplexed together to form a depth map part. In an example, the two depth map segments of a 540 x 960 input depth map (e.g., 540 x 480 ZLA and 540 x 480 ZLB) may be stacked vertically.
In some embodiments, a depth map may be segmented into more than two segments. In some embodiments, a depth map may be segmented across the vertical direction. In some embodiments, a depth map may be segmented across both the vertical and horizontal directions. In some embodiments, the depth map may be segmented into unequal segments. In some embodiments, the segments may be stacked horizontally, vertically, or both vertically and horizontally.
In some embodiments, one or more of the segmented depth maps maybe flipped horizontally or vertically before being stored as part of the multiplexed image. Experiments have shown that such flipping reduces the coding artifacts at the borders between the texture part and the depth parts of the coded multiplexed image (e.g., 1512C-YS). Furthermore, there are fewer coding artifacts at the center of the split depth-map image.
In an example embodiment, let d[i,j] denote pixel values of a segment of a depth map (e.g., ZLB). Let D w denote the width of this segment. If the pixels values of this segment are flipped across the left vertical axis, then, for the i-th row, the pixel values of the horizontally flipped segment (d_hj[i,j]) may be determined as:
for (j=0; j< Dw; j++)
d_hf[i,j] = d[i,Dw-j];
A decoder receiving image with segmented depth maps (e.g., 1512C-YS) may use metadata to properly align all the decoded depth map segments to reconstruct the original depth map (e.g., ZL), and thus re-generate a proper 3D output image. Any flipped depth-map segments will need to be flipped back to their original orientation before being used for rendering the final output.
In some embodiments, asymmetric spatial multiplexing and depth map segmentation may also be applied to depth delivery formats that include both image views of the input image (e.g., FIG. 12A and FIG. 12B ).
In an embodiment, the RPUZ process of 1620 can be eliminated. An encoder may simply use a constant flat gray value to predict ZL depth data during the coding process of the EL-2 1610 layer (e.g., all pixel values of the predictor may be set equal to 128 for 8-bit pictures).
As depicted in FIG. 17A , in an embodiment, the base layer (BL) comprises two parts: a side-by-side (e.g., 1920x1080) multiplexed picture (112) and a subsampled version of depth data for either the left view or the right view (e.g., 1920x8 ZL' 1710). Because depth data have no chroma information, in an embodiment, chroma-related data for the extra padding rows of the BL signal (1735) may be simply set to a constant value (e.g., 128).
In an embodiment, signal ZL' 1710 may be created as follows. Let ZL denote a high-resolution left-view depth data signal (e.g., 960x540). This signal may be filtered and sub-sampled both horizontally and vertically to generate a sub-sampled version that can fit within the resolution of the padding data (e.g., 1920x8). For example, given a 960x540 signal one may generate a 240x60 signal ZL'. Then one can pack the 240∗60=14,400 ZL' bytes into the available space of 1920∗8=15,360 bytes using any suitable packing scheme.
As depicted in FIG. 17A , in an embodiment, the enhancement layer (EL) comprises top-and-bottom data luminance data (117-Y), lower resolution left view or right view depth data (e.g., ZR' 1715), and high-resolution left-view and right-view depth data (1745-U and 1745-V). For example, in the luminance signal, ZR' 1715 may comprise a 240x60 sub-sampled version of the original ZR depth data, packed into the 1920x8 padding area. For chroma (1745), instead of transmitting the chroma of the top-and-bottom signal (117) one may transmit high resolution ZR and ZL depth data. In an embodiment, instead of transmitting the U (or Cb) chroma data, one may transmit the even columns of ZR and ZL (ZR-e, ZL-e 1745-U) and instead of transmitting the V (or Cr) data of 117 one may transmit the odd columns of ZR and ZL (ZR-o, ZL-o 1745-V). As in the BL, ZR' data 1715 have no chroma information hence their corresponding chroma data (1740) may be set to a fixed value (e.g., 128).
Because of the inter-layer prediction requirements and the sequential nature of coding and decoding macroblock data, in practice, at least depth data ZR' (1715) for a frame coded at time t may actually represent depth data for a previously coded frame, say at time t-1 or earlier. This delay may be necessary to allow RPUz 1730 to fully reconstruct all depth data (e.g., ZR') needed to code (or decode) ZL and ZR in the enhancement layer (1765). For example, during encoding, at time T0 , the EL(T 0) frame may comprise dummy ZR'data (e.g., all values are set equal to 128). Then, the EL(T1 ) frame may comprise depth data of the T0 frame, the EL(T2 ) frame may comprise depth data of the T1 frame, and so forth. During decoding, the dummy depth data of the first decoded frame will be ignored and depth-data will be recovered with at least a one-frame delay.
The luminance of EL can be encoded on its own using a second EL encoder or, as depicted in FIG. 17A , it can be encoded using texture RPUT 1725 with reference to the base layer. A depth map RPUz (1730) may also be used so that the high-resolution depth data in the "chroma" space of EL may be coded by taking into consideration the sub-sampled ZL' (1710) and ZR' data (1715). For example, in an embodiment, RPUz (1730) may comprise a simple up-sampler.
Given the bit streams depicted in FIG. 17A , a single decoder can decode the BL stream and extract a frame compatible (FC) 3D stream plus sub-sampled depth data for one of the views. A dual-layer (e.g., MVC) decoder may decode an FCFR 3D stream plus ZL and ZR depth data.
In another embodiment, the EL streams as depicted in FIG. 17A or FIG. 17B may include depth data in only parts of the EL-U (1745-U or 1765-U) or EL-V (1745-V or 1765-V) regions. For example, the ZR-O, ZL-o 1745-V streams or the RZR-o, RZL-o 1765-V streams may be replaced by a constant value (e.g., 128). This approach reduces the bit rate requirements at the expense of lower depth map resolution.
Another approach to reduce bit rate requirements comprises transmitting depth map data for only one view (say, ZR). In such a scenario, all data for the other view depth region (say, ZL) may be filled with a constant value (e.g., 128). Alternatively, one may transmit depth map data for a single view (say, ZR) at double the resolution than before. For example, in an embodiment, ZL-o and ZL-e depth data may be replaced by additional ZR data.
Both FIG. 17A and FIG. 17B depict embodiments where the base layer comprises a side-by-side stream and the enhancement layer comprises a top-and-bottom stream. In other embodiments, the same process may be applied to a system where the BL comprises a top-and-bottom stream and the EL comprises the side-by-side stream.
Embodiments of the present invention may be implemented with a computer system, systems configured in electronic circuitry and components, an integrated circuit (IC) device such as a microcontroller, a field programmable gate array (FPGA), or another configurable or programmable logic device (PLD), a discrete time or digital signal processor (DSP), an application specific IC (ASIC), and/or apparatus that includes one or more of such systems, devices or components. The computer and/or IC may perform, control or execute instructions relating to encoding and decoding depth map delivery formats, such as those described herein. The computer and/or IC may compute any of a variety of parameters or values that relate to encoding and decoding depth map delivery formats as described herein. The image and video dynamic range extension embodiments may be implemented in hardware, software, firmware and various combinations thereof.
Certain implementations of the invention comprise computer processors which execute software instructions which cause the processors to perform a method of the invention. For example, one or more processors in a display, an encoder, a set top box, a transcoder or the like may implement methods for encoding and decoding depth map delivery formats as described above by executing software instructions in a program memory accessible to the processors. The invention may also be provided in the form of a program product. The program product may comprise any medium which carries a set of computer-readable signals comprising instructions which, when executed by a data processor, cause the data processor to execute a method of the invention. Program products according to the invention may be in any of a wide variety of forms. The program product may comprise, for example, physical media such as magnetic data storage media including floppy diskettes, hard disk drives, optical data storage media including CD ROMs, DVDs, electronic data storage media including ROMs, flash RAM, or the like. The computer-readable signals on the program product may optionally be compressed or encrypted.
Claims (9)
- A method for delivering 3D depth map data, the method comprising:accessing an input picture comprising a first view and a second view, wherein each view has a horizontal and a vertical pixel resolution;generating a side-by-side picture and a top-and-bottom picture based on the input picture, wherein the side-by-side picture and top-and-bottom picture comprise half-resolution data of the first and second views;encoding using an encoder one of the side-by-side picture and the top-and-bottom picture to generate a coded base layer stream (212);encoding, using the encoder and a texture reference processing unit (225), the other one of the side-by-side picture and the top-and-bottom picture to generate a coded first enhancement layer (EL-1, 217), wherein the coded first enhancement layer (EL-1) is partially coded based on the coded base layer stream interpolated by the texture reference processing unit (225) and partially on said other one of the side-by-side picture and the top-and-bottom picture; andcharacterized in that the method further comprises accessing input depth map data for the input picture; encoding, using the encoder and a depth-map reference processing unit (230), sub-sampled depth map data for said one of the side-by-side picture and the top-and-bottom picture to generate a coded second enhancement layer (EL-2), wherein the coded second enhancement layer (EL-2) is partially coded based on to the base layer (BL) and partially coded based on the sub-sampled depth map data, wherein the depth-map reference processing unit (230) is configured to estimate sub-sampled depth map data to be used by the second enhancement layer based on an estimated depth data from the coded base layer stream (212), the sub-sampled depth map data for said one of the side-by-side picture and the top-and-bottom picture being generated based on the input depth data of the input picture, and wherein the coded second enhancement layer (EL-2) is representative of sub-sampled depth map data encoded as a primary channel (219S-A; 219T-A) and secondary channels (219S-B; 219T-B), wherein the method comprises incorporating depth information missing from sub-sampled depth map data (ZL, ZR) for said one of the side-by-side picture and the top-and-bottom picture of the primary channel (219S-A; 219T-A) into the secondary channels (219S-B; 219T-B).
- The method of claim 1, wherein the coded second enhancement layer (EL-2) carries the difference between the accessed input depth map data and the estimated depth map data.
- The method of claim 1, comprising generating a side-by-side picture based on horizontal sub-sampling of the first view and second view and generating a top-and-bottom picture based on vertical sub-sampling of the first view and second view of the input picture; and/or multiplexing the coded base layer, the coded first enhancement layer, and the coded second enhancement layer into an output coded bitstream.
- The method of claim 1, wherein the depth-map reference processing unit generates an estimate of a first view depth map and an estimate of a second view depth map based on the input picture.
- The method of claim 1, wherein the depth-map reference processing unit (230) estimates depth map data based on the input picture and wherein the coded second enhancement layer (EL-2) preferably carries the difference between the accessed input depth map data and the estimated depth map data.
- The method of claim 1, further comprising:generating a first half picture having half the horizontal pixel resolution and the same vertical pixel resolution as the first view of the input picture;generating a second half picture having half the horizontal pixel resolution and the same vertical pixel resolution as the second view of the input picture;multiplexing the first half picture and the second half picture to generate the side-by-side picture.
- The method of claim 1, further comprising:generating a third half picture having half the vertical pixel resolution and the same horizontal pixel resolution as the first view of the input picture;generating a fourth half picture having half the vertical pixel resolution and the same horizontal pixel resolution as the second view of the input picture;multiplexing the third half picture and the fourth half picture to generate the top-and-bottom picture.
- A data processing apparatus comprising a processor and configured to perform any one of the methods recited in claims 1-7.
- A computer program product having computer-executable instructions, which when executed by a computer, cause the computer to carry out a method in accordance with any of the claims 1-7.
Applications Claiming Priority (7)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| US201261659588P | 2012-06-14 | ||
| US201261712131P | 2012-10-10 | ||
| US201261739886P | 2012-12-20 | ||
| US201361767416P | 2013-02-21 | ||
| US201361807013P | 2013-04-01 | ||
| US201361807668P | 2013-04-02 | ||
| US201361822060P | 2013-05-10 |
Publications (2)
| Publication Number | Publication Date |
|---|---|
| HK1261742A1 true HK1261742A1 (en) | 2020-01-03 |
| HK1261742B HK1261742B (en) | 2020-09-11 |
Family
ID=
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| EP3399755B1 (en) | Depth map delivery formats for stereoscopic and auto-stereoscopic displays | |
| EP2591609B1 (en) | Method and apparatus for multi-layered image and video coding using reference processing signals | |
| KR101676059B1 (en) | Frame packing for video coding | |
| EP2995081B1 (en) | Depth map delivery formats for multi-view auto-stereoscopic displays | |
| HK1261742B (en) | Depth map delivery formats for stereoscopic and auto-stereoscopic displays | |
| HK1261742A1 (en) | Depth map delivery formats for stereoscopic and auto-stereoscopic displays | |
| HK1206182B (en) | Frame compatible depth map delivery formats for stereoscopic and auto-stereoscopic displays | |
| HK1227199B (en) | Decoding method for multi-layered frame-compatible video delivery | |
| HK1227199A1 (en) | Decoding method for multi-layered frame-compatible video delivery | |
| HK1227198B (en) | Decoding method for multi-layered frame-compatible video delivery |