WO2009091563A1

WO2009091563A1 - Depth-image-based rendering

Info

Publication number: WO2009091563A1
Application number: PCT/US2009/000245
Authority: WO
Inventors: Yu Huang; Joan Llach
Original assignee: Thomson Licensing SAS
Current assignee: Thomson Licensing SAS
Priority date: 2008-01-18
Filing date: 2009-01-15
Publication date: 2009-07-23
Anticipated expiration: 2010-07-18

Abstract

Various implementations are described. Several implementations relate to depth-image-based rendering. Many of these implementations use temporal information in synthesizing an image. For example, temporal information may be used to generate a background layer for a warped image, and then the background layer may be blended with the main layer. One method includes accessing information from a reference image (1005). The reference image is for a reference view at a particular time. Information from a second image is accessed (1010). The second image is for a different time than the particular time. An additional image is created based on the information from the reference image and on the information from the second image (1015). The additional image is for an additional view that is different from the reference view and being for the particular time.

Description

,

DEPTH-IMAGE-BASED RENDERING

CROSS-REFERENCE TO RELATED APPLICATIONS

This application claims the benefit of U.S. Provisional Application Serial No. 61/011,519, filed on January 18, 2008, titled "Depth-Image-Based Rendering", the contents of which are hereby incorporated by reference in their entirety for all purposes.

TECHNICAL FIELD Implementations are described that relate, for example, to coding and decoding systems and apparatus including the same. Particular implementations relate to depth-image-based rendering.

BACKGROUND Some three dimensional applications create an intermediary view by interpolating between two views, or simply extending a single view. However, background objects (holes) can be uncovered when creating the intermediary view and typically information is unavailable for such objects, thus presenting a problem relating to how such objects should be treated in order to obtain an accurate representation of the same. The creation of these holes is referred to as the dis-occlusion problem. Moreover, there are other problems. For example, the occlusion problem described herein below, as well as artifacts created at the boundary between objects at different depths during the warping process are other problems that may be addressed.

SUMMARY

According to a general aspect, information from a reference image is accessed. The reference image is for a reference view at a particular time. Information from a second image is accessed. The second image is for a different time than the particular time. An additional image is created based on the information from the reference image and on the information from the second image. The additional image is for an additional view that is different from the reference view and is for the particular time. The details of one or more implementations are set forth in the accompanying drawings and the description below. Even if described in one particular manner, it should be clear that implementations may be configured or embodied in various manners. For example, an implementation may be performed as a method, or embodied as apparatus, such as, for example, an apparatus configured to perform a set of operations or an apparatus storing instructions for performing a set of operations, or embodied in a signal. Other aspects and features will become apparent from the following detailed description considered in conjunction with the accompanying drawings and the claims.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 is a block diagram of an implementation of an encoder for encoding image data for a view obtained using depth-image-based rendering.

FIG. 2 is a block diagram of an implementation of a decoder for decoding image data for a view obtained using depth-image-based rendering.

FIG. 3 is a block diagram of an implementation of an apparatus for encoding and transmitting image data for a view obtained using depth-image-based rendering.

FIG. 4 is a block diagram of an implementation of an apparatus for demodulating and decoding image data for a view obtained using depth-image-based rendering.

FIG. 5 is a diagram of an implementation of a pixel-based boundary layer construction method.

FIG. 6 is a diagram of an implementation of a splatting technique with respect to boundary layer construction. FIG. 7 is a diagram of an implementation of a triangle-based boundary layer construction method.

FIG. 8 is a diagram of an implementation of a mesh warping technique with respect to boundary layer construction.

FIG. 9 is a diagram of an implementation of multiple (two) reference views based rendering.

FIG. 10 is a flow diagram of an implementation of a method for encoding and transmitting image data for a view obtained using depth-image-based rendering.

FIG. 1 1 is a flow diagram further illustrating step 1015 of method 1000 of FIG. 10. FIG. 12 is a flow diagram of an implementation of a method for demodulating and decoding image data for a view obtained using depth-image-based rendering.

DETAILED DESCRIPTION Image-based rendering (IBR) combines both computer vision and computer graphics technologies to generate a novel view using a collection of images from different viewpoints. In the past decade, IBR has received much attention as a powerful alternative to the traditional geometry-based rendering for view synthesis. Applications such as video games, virtual travel, multi-view video coding (MVC), three dimensional (3D) television, and free viewpoint video (FW) stand to benefit from this technology.

Depth-image-based rendering (DIBR) is a technique of view synthesis that uses a number of images captured from multiple calibrated cameras as well as associated per-pixel depth information. The per-pixel depth information may be computed using, for example, stereo vision. In the rendering process, various methods may be used to deal with the occlusion and dis-occlusion problems. As used herein, the occlusion problem, also interchangeably referred to as the visibility problem, refers to the situation when multiple pixels are mapped to the same location in the synthesized view. With respect to the occlusion problem, an image portion (e.g., one or more pixels, one or more image blocks, and so forth) is not visible in the new view obtained by warping, although the image portion was visible prior to the warping. Moreover, as used herein, the dis-occlusion problem, also interchangeably referred to herein as the exposure problem, refers to the situation when previously invisible scene points are uncovered in the synthesized view, producing what are commonly referred to as holes. With respect to the dis-occlusion problem, an image portion (e.g., one or more pixels, one or more image blocks, and so forth) is visible (although likely represented as a "hole") in the new view obtained by warping, although the image portion was not visible prior to the warping.

One technique for dealing with the dis-occlusion problem includes creating a boundary layer around the hole. This technique determines which pixel in the boundary layer has the greatest depth, and copies this pixel to the hole based on the assumed rationale that, the pixel is in the background and odds are that the hole is in the background. However, the copied pixel might not be in the background. Also, _Λ

even if the copied pixel is in the background, the background might not be a solid color.

A second technique, discussed in at least one implementation in this application, proposes a layered method to resolve the visibility problem in depth-image-based rendering. In at least one such implementation, for each reference view, we use a novel three-layer representation, that is, the main layer, the background layer and the boundary layer. As used herein, the phrases "boundary" and "boundary layer" generally refer to an edge which results from depth discontinuities. Based on the rendering algorithm which may involve, but is not limited to, for example, pixel-based (splatting) or triangular mesh-based, we design an associated method to generate the boundary layer in a spatio-temporal manner. We build a temporal background model for each frame by searching backward and forward for uncovered background information in other frames in the same reference video, based on depth variance. Three dimensional image warping can be used to realize DIBR. Three dimensional image warping is well known to one of ordinary skill in the art. Three dimensional warping generates a novel image from any nearby viewpoints by un-projecting pixels of reference images from the proper three dimensional locations and re-projecting them onto the new image space. After three dimensional warping, the determination of colors per pixel in the synthesized view is typically the classical computer graphics problem of reconstruction and re-sampling. Generally, the rendering method can be pixel-based (splatting), or mesh-based (triangular). Either method is capable of dealing with the occlusion and dis-occlusion problem. As noted above, the occlusion (visibility) problem refers to the case when multiple pixels are mapped to the same location in the synthesized view. One solution to the occlusion problem is Z-buffering. An alternative method is mapping the pixels in a specific order referred to as back-to-front occlusion compatible. One short-coming of the alternate method is unavailability for rendering with multiple reference images, since a mapping order cannot be found for multiple reference views or sources.

As noted above, dis-occlusion (exposure) occurs when the previously invisible scene points are uncovered in the synthesized view, producing what are commonly referred to as holes. Since the reference view does not provide information about this portion, a view synthesis system may assume the background typically extends into the hole. This simplistic approach would examine the depth of all the pixels bordering the hole, and copy the pixel that is the farthest away to each exposed pixel. This method is generally inefficient and not appropriate for textured backgrounds.

In the reference view, depth discontinuities at the boundary between the foreground and background can be considered to cause the holes. The depth discontinuities may be located, and a boundary strip may be created around these depth-discontinuity pixels. As used herein, a boundary strip refers to a narrow (one or more pixels wide) strip between the foreground and the background in a particular picture or portion of a picture. A Bayesian matting may be used to determine the color and depth within these strips. While rendering the synthesized view, both the main layer and the boundary layer may be blended together to remove cracks and artifacts.

In video-based rendering, temporal artifacts could be visible when the hole-filling method in IBR is applied for each frame. Meanwhile, the occluded region in some frame could be uncovered at other frames of the video captured from the same view, since the foreground may disappear while the background may appear in the same location. In view synthesis from a pair of rectified images for one-to-one conferencing, a method may be used for temporal maintenance of a background model that helps fill in holes and reduce temporal artifacts. The method segments the unique foreground from the background first by a bi-modal histogram thresholding, then updates the background model with the newly discovered pixels of the background. Although the bi-modal histogram looks characteristic of the relatively simple scene of a talking head, the bi-modal histogram will face more challenges when applied to a complicated background. Moreover, in general, range data segmentation is difficult. Further, another challenging problem is background maintenance although the camera is assumed fixed. Several issues in real scenarios include, for example, illuminance change (lighting conditions), small moving objects on the background (e.g., moving curtain or shaking tree leaves), sleeping object (moving into the background and then motionless), waking object (moving away from the background), foreground objects' shadows, and so forth. FIG. 1 shows a non-limiting block diagram of an implementation of an encoder

100 for encoding image data for a view obtained using depth-image-based rendering. The encoder 100 includes a view multiplexer 105 having an output connected in signal communication with a non-inverting input of a combiner 110 and a first input of a motion estimator 130. An output of the combiner 110 is connected in signal communication with an input of a transformer 115. An output of the transformer 115 is connected in signal communication with an input of a quantizer 120. An output of the quantizer 120 is connected in signal communication with an input of an entropy coder 125 and an input of an inverse quantizer 140. An output of the inverse quantizer 140 is connected in signal communication with an input of an inverse transformer 145. An output of the inverse transformer 145 is connected in signal communication with a first non-inverting input of a combiner 150. An output of the combiner 150 is connected in signal communication with an input of an intra predictor 164 and with an input of a deblocking filter 152. An output of the deblocking filter is connected in signal communication with a first input of an image warper 155 and an input of a reference view portion 170 of a decoder picture buffer 177. An output of the image warper 155 is connected in signal communication with an input of a synthesized view portion 165 of the decoder picture buffer 177. An output of the reference view portion 170 and an output of the synthesized view portion 165 are connected in signal communication with a second input of the motion estimator 130 and a first input of a motion compensator 135. An output of the motion estimator 130 is connected in signal communication with a second input of the motion compensator 135. An output of the motion compensator 135 is connected in signal communication with a first input of an inter/intra or synthesis mode selector 166. An output of the inter/intra or synthesis mode selector 166 is connected in signal communication with an inverting input of the combiner 110 and a second non-inverting input of the combiner 150. An output of the intra predictor 164 is connected in signal communication with a second input of the inter/intra or synthesis mode selector 166. Inputs of the view multiplexer 105 are available as inputs to the encoder 100, for receiving picture data for views 0 through N. A second input and third input of the image warper 155 are available as inputs of the encoder 100, for receiving camera parameters and depth values. An output of the entropy coder 125 is available as an output of the encoder 100, for outputting a bitstream corresponding to the multi-view picture data.

The image warper 155 takes the last encoded image (for a particular view) and creates warped images for one or more views other than the particular view. The warped images are stored in the decoder picture buffer 177 and will be used as reference for the encoding of future images.

The decoder picture buffer 177 includes all the reference images available for encoding future images. The reference view portion 170 includes previously decoded -

images, and the synthesized view portion 165 includes the set of warped images created from the previously decoded images.

The mode selector 166 selects the best prediction mode to be used for the encoding. Besides the two modes available in standard video encoders (inter and intra modes), the modified mode selector 166 can also choose a synthesis mode which uses a synthesized image for the inter prediction.

The remaining elements in FIG. 1 essentially operate as in any standard MPEG-4 AVC encoder.

FIG. 2 shows a non-limiting block diagram of an implementation of a decoder 200 for decoding image data for a view obtained using depth-image-based rendering. The decoder 200 includes an entropy decoder 205 having an output connected in signal communication with an input of an inverse quantizer 210. An output of the inverse quantizer 210 is connected in signal communication with an input of an inverse transformer 215. An output of the inverse transformer 215 is connected in signal communication with a first non-inverting input of a combiner 220. An output of the combiner 220 is connected in signal communication with an input of a deblocking filter 225 and an input of an intra predictor 235. An output of the deblocking filter is connected in signal communication with an input of a picture buffer 240. An output of the picture buffer 240 is connected in signal communication with a first input of a motion compensator 260 and a first input of an image warper 250. An output of the image warper 250 is also connected in signal communication with the first input of the motion compensator 260. An output of the motion compensator 260 is connected in signal communication with a first input of an intra/inter or synthesis mode selector 230. An output of an intra predictor is connected in signal communication with a second input of the intra/inter or synthesis mode selector 230. An output of the intra/inter or synthesis mode selector 230 is connected in signal communication with a second non-inverting input of the combiner 220. An input of the entropy decoder 205 is available as an input of the decoder 200, for receiving a bitstream. A second input of the motion compensator 260 is available as an input of the decoder 200, for receiving motion vectors. A second input of the image warper 250 is available as an input of the decoder 200, for receiving camera parameters. A third input of the image warper 250 is available as an input of the decoder 200, for receiving depth values. An output of the deblocking filter 225 is available as an output of the decoder 200, for outputting pictures. o

The intra/inter or synthesis mode selector 230 selects the prediction mode to be used for the decoding based on the information present on the received bit stream. Besides the two modes available in standard video decoders (intra and inter modes), the modified mode selector 230 can also choose a synthesis mode which uses a synthesized image for the inter prediction.

The image warper 250 creates a synthesized image from one of the decoded pictures stored in the decoded picture buffer 240 when such synthesized image is required by the mode selector 230. The parameters required to perform the image synthesis are obtained from the received bit stream. The remaining elements in FIG. 2 operate as in any standard MPEG-4 AVC decoder.

FIG. 3 shows a non-limiting block diagram of an implementation of an apparatus 300 for encoding and transmitting image data for a view obtained using depth-image-based rendering. The apparatus 300 includes a rendering unit 305 having an output connected in signal communication with an input of an encoder 310. An output of the encoder 310 is connected in signal communication with an input of a transmitter 315. An input of the rendering unit 305 is available as an input of the apparatus 300, for receiving a reference image and a second image. An output of the transmitter 315 is available as an output of the apparatus 300, for outputting encoded images for transmission, for example, over one or more networks.

In an embodiment, the rendering unit 305 is configured to access information from a reference image and a second image. The reference image is for a reference view at a particular time, and the second image is for a different time than the particular time. The rendering unit 305 is also configured to create an additional image based on the information from the reference image and on the information from the second image. The additional image is for an additional view that is different from the reference view and being for the particular time. The encoder 310 is configured to encode the reference image, the second image, and the additional image. The transmitter 315 is configured to transmit the encoded reference image, the encoded second image, and the encoded additional image.

In an embodiment, the rendering unit 305 includes a memory interface 306 and a synthesizer 307. In an embodiment, the memory interface 306 may be configured to access the information from the reference image and second image. The memory _g

interface 306 may also be configured to create the additional image based on the information from the reference image and on the information from the second image. The transmitter 315 may be, for example, adapted to transmit a program signal having one or more bitstreams representing encoded pictures and/or information related thereto. Typical transmitters perform functions such as, for example, one or more of providing error-correction coding, interleaving the data in the signal, randomizing the energy in the signal, and modulating the signal onto one or more carriers. The transmitter may include, or interface with, an antenna (not shown).

FIG. 4 shows a non-limiting block diagram of an implementation of a decoder 400 for demodulating and decoding image data for a view obtained using depth-image-based rendering. The apparatus 400 includes a demodulator 405 having an output connected in signal communication with an input of a decoder 410. An output of the decoder 410 is connected in signal communication with an input of a rendering unit 415. An output of the rendering unit is connected in signal communication with an input of a presentation device 420. An input of the demodulator 405 is available as an input to the apparatus 400, for receiving a signal including an encoded reference image and an encoded second image. An output of the presentation device 420 is available as an output of the apparatus 400, for displaying any of the reference image, the second image, and an additional image. In an embodiment, the demodulator 405 is configured to receive and demodulate a signal. The signal includes an encoded reference image and an encoded second image. The reference image is for a reference view at a particular time. The second image is for a different time than the particular time. The decoder 410 is configured to decode the encoded reference image and the encoded second image. The rendering unit 415 is configured to access information from the decoded reference image, to access information from the decoded second image, and to create an additional image based on the information from the decoded reference image and on the information from the decoded second image. The additional image is for an additional view that is different from the reference view and is for the particular time. The presentation device 420 is configured to display the additional image.

The demodulator 405 may be, for example, adapted to receive a program signal having a plurality of bitstreams representing encoded pictures. Typical receivers perform functions such as, for example, one or more of receiving a modulated and encoded data signal, demodulating the data signal from one or more _{1 Q}

carriers, de-randomizing the energy in the signal, de-interleaving the data in the signal, and error-correction decoding the signal. The functions of decoder 410 may also be performed by the demodulator 405 in various implementations. The demodulator 405 may include, or interface with, an antenna (not shown). It is to be appreciated that apparatus 300, apparatus 400, and/or other implementations of the present principles may be implemented in a set top box, a transmitter, mobile phones, personal digital assistants (PDAs), mobile computers, and so forth.

As another example, apparatus 300 may represent all or part of a video transmission system. The video transmission system may be, for example, a head-end or transmission system for transmitting a signal using one or more of a variety of media, such as, for example, satellite, cable, telephone-line, or terrestrial broadcast. The transmission may be provided over the Internet or some other network. As yet another example, apparatus 400 may represent all or part of a video receiving system. The video receiving system may be, for example, a cell-phone, a computer, a set-top box, a television, or other device that receives encoded video and provides, for example, decoded video for display to a user or for storage. Thus, the video receiving system may provide its output to, for example, a screen of a television, a computer monitor, a computer (for storage, processing, or display), or some other storage, processing, or display device. The video receiving system may be configured, for example, to receive signals over one or more of a variety of media, such as, for example, satellite, cable, telephone-line, or terrestrial broadcast. The signals may be received over the Internet or some other network.

1. 3-D Warping

One technical route to realize DIBR is via the three dimensional image warping that is known in computer graphics literatures. If we define a three dimensional point by its homogenous coordinates p - (x,y,z,\)' , and its perspective projection in the reference image plane by P_r = (u_r, v_Λ,l)' , then we have a general perspective projection defined as follows:

w_rP_r = PPM_rP , (1) _{χ }}

where w_r is the depth factor, and PPM _r is the 3x4 perspective projection matrix built by extrinisic and intrinsic parameters of the calibrated reference camera. Correspondingly, we get the equation for the synthesized view as follows:

w_sP^ PPM,_{P <} (2)

where P₅ is also a homogeneous coordinate as defined above, and w_s is a depth scaling factor. We denote the twelve elements of PPM _r as q_tJ , / =1,2,3, j=1,2,3,4. From the image point P_r and its depth z, we can estimate the other two components of the three dimensional point p by a linear equation as follows:

using the following equalities:

b_\ = føi₄ - ^M _r?₃₄) + føπ ~ u_rq^)z , O_{1 1} = u_rq_3l - q_n , a_n = u_rq_i2 - q_u . b₂ = (?₂₄ - V^₃₄) + (^₂₃ -V^₃₃)Z , α₂₁ = v_rq_{2 l} - g_2] , a₂₂ = v_rq₃₂ - q₂₂ .

Re-projecting the three dimensional point onto the synthesized image plane using Equation (2), we obtain the novel view's image P₁. In re-projecting, we map the three dimensional point onto a two dimensional point in the image plane.

2. Boundary Layer In the reference view, depth discontinuities at the boundary between the foreground and background cause the holes in the synthesized view. Since the pixels along the boundary of the objects receive contributions from the foreground and the background colors, these mixed color pixels will result in visible artifacts in depth-image-based rendering. Boundary matting is a technique to reduce the artifacts caused by mixed pixels. Boundary matting and the generation of a boundary layer are well-known to one of ordinary skill in the art. , ,

12

In at least one implementation, we use the boundary layer mainly for filling in holes. First we locate the depth discontinuities by checking if the disparity jump between each neighboring pixel pair is greater than ξ pixels, denoted with a boolean function dpbound (x,y). Note that a disparity image is typically generated in three dimensional warping, and the disparity jumps can be determined based on the disparity image. The threshold, ξ , can be selected based on the scene and intensity range of the disparity image. In some implementations, the range is 0-255 and the threshold is selected as 5 pixels. Based on the associated rendering method, for example, splatting or mesh warping, herein below we discuss the procedure to form the boundary layer.

Splatting

Splatting is a well-known technique. FIG. 5 shows a non-limiting diagram of an implementation of a pixel-based boundary layer construction method 500, also interchangeably referred to herein as Algorithm 1. That is, Algorithm 1 is the pixel-based process used to label the boundary layer pixels and determine their color and disparity values based on the background extension. Algorithm 1 checks pixel d's disparity value disp{d) with its 8 pixel neighborhood. FIG. 6 illustrates pixel d and its 8 pixel neighborhood. That is, FIG. 6 shows a non-limiting diagram of an implementation of a splatting technique 600 with respect to boundary layer construction. The modification formula for the pair of pixels at the depth discontinuity is shown with respect to FIG. 6. If a depth jump is found between pixel d and pixel e₄, where d is the foreground and e₄ is the background, then we modify d and e₄ from U as follows:

_* val {θ4 ) = α ^■ val{f₄ ) + (1 - α) ■ val{e^ ) , vat*{d) = β- val^*{e₄) + (1 - β) ■ val(d) ,

where α,β is a constant factor (whose value is preferably, but not mandorily as follows: 0.5<α </?<1) and val(.) denotes the color or depth information. Note that in general we are extending the background into the boundary layer. We erode the boundary layer obtained by Algorithm 1 by one pixel to prevent cracks from appearing in the rendering. Mesh warping

Mesh warping is well-known, particularly in the area of computer graphics. Since, in at least one implementation, we use the triangular mesh as the basic primitive, we check the disparity jump in another way. FIG. 7 shows a non-limiting diagram of an implementation of triangle-based boundary layer construction method 700, also interchangeably referred to herein as Algorithm 2. That is, Algorithm 2 is the triangle-based process used to split each section into two triangles, label the boundary layer pixels, and determine their color and disparity value based on background extension. Thus, Algorithm 2 provides boundary layer pixels using mesh warping. FIG. 8 shows a non-limiting diagram of an implementation of a mesh warping technique 800 with respect to boundary layer construction. That is, FIG. 8 illustrates how we check each 2x2 section and its neighborhood with respect to the boundary layer construction. The modification of a triangle at depth discontunities is similar to the pixel operation. As shown with respect to FIG. 8, if there is depth jump in triangle Cf₁Cf₂Cf₃, where c/i is the foreground and d₂d₃ is the background, then we modify triangle cficf₂cf3 from triangle d^^cz as follows:

val \d₂) = a - val(c_g ) + (1 - a) ■ val(d₂ ) , val' (d₃) = a - vα/(c, ,) + (. - «) ^• val(d₃ ) , val' {d_λ) = β- val{d \ ) + (1 - β) ^■ val{d, ) .

In the triangle-based boundary layer, we do not run erosion like the pixel-based one because the latter is in fact one pixel wider. It is shown in Algorithms 1 and 2 that we determine the pixels' color in the boundary layer by extending the background, either pixel-based or triangle-based. In Algorithms 1 and 2, we repeatedly compare disparity along varying directions. If the pixel of the boundary layer is touched by multiple extensions from different directions, then its ultimate color and depth are determined by the one with the largest depth. The largest depth is most likely to be the background.

Furthermore, for each pixel in the boundary region, we exploit its information in the temporal dimension. That is, we search forward and backward, starting from the closest (in time) frames in the same video for uncovered background by checking whether abrupt disparity reduction is more than a given threshold. In particular, in one _{] 4}

implementation, we search each direction (forward and backward) until we find an abrupt disparity reduction satisfying the threshold. The pixel corresponding to the abrupt disparity reduction is a background candidate pixel.

After the abrupt disparity reductions, we may find a smooth disparity change during a period of continuous frames. If so, then the corresponding pixels in these frames can also each be considered a background candidate pixel. A simple method is median-filtering them on the disparity component. Alternatively, for each direction, we can select the pixel whose color is consistent with the existing color determined by background extension as above. The color consistency measure is chosen to be a L₂ distance in the RGB space, where an L₂ distance is the 2^nd root of the sum of squared difference, like sqrt(r^Λ2+g^Λ2+b^Λ2). If different background candidates are found in the forward and backward directions, then the color consistency metric is also used to determine the ultimate background candidate selected. If no background candidate pixels are found, then the existing information obtained from the background extension will be preserved.

In summary, our method to build the boundary layer works in a spatial-temporal way. Information in the temporal dimension is exploited that is more reliable than simply extending the background in the spatial dimension. Note that Algorithms 1 and 2 can replace, for example, the original RGB and depth values using background extension. Various implementations may use temporal pictures, and/or the background layer described below, in generating the boundary layer.

3. Background Layer

Usually holes result from unknown information in the novel view. Multiple reference views provide spatial information in IBR, while temporal consistency offers additional information in video-based rendering.

We propose a temporal method that uses depth information, such as, for example, actual depth values or disparity values. Recall that disparity values are related to depth values. In particular, below we propose a disparity-based temporal method which forms the background for each frame as described herein below. The idea is similar to the temporal background extension in the boundary layer described herein above.

For each pixel location, we investigate its varying disparity in the whole sequence and detect abrupt disparity change that is greater than a given threshold. , ₅

Such disparity suggests, for example, that the foreground moves away and the background appears. This varying disparity as a function of frame, for a given pixel location, can be referred to as the temporal disparity curve.

Based on the preceding, we separate the temporal disparity curve into different segments, where disparity in each segment varies smoothly. Thus, we separate the curve at, for example, the abrupt disparity changes.

For example, a temporal disparity curve may look like a stepped function having a relatively constant value for a first segment of time, and a second (different) relatively constant value for a second segment of time. The jump from the first segment to the second segment indicates, for example, that at the pixel being investigated an object has moved away from the pixel's location and revealed the background for that object. For example, a person may have moved, revealing a parked car behind. Note also that a third segment in time may be associated with the car moving away, revealing a building behind. Each segment is assumed to possess the same background, and that background is also assumed to possibly be the background for another segment. Note that if the disparity is smooth, then we assume that there is no substantial depth change and that the object does not move much. However, the object may be in the background or the foreground. Eventually for each segment we search forward and background to find its background from neighboring segments. Implementations may use an algorithm that utilizes the spatial-temporal (color) consistency to optimize the background model. Such an algorithm can be time-consuming and complicated. Accordingly, other implementations instead use simple median filtering to determine the disparity and the color for each segment.

One technique for determining the pixel value associated with the background for a given pixel is now presented. We examine the disparity curve at the current time and note the disparity value for the segment that includes the current time. Then we move forward and backward in time to find, for example, a neighboring segment that represents the background for that pixel (higher depth). Then we select a time that corresponds to the background segment and access the picture from the selected time (forward or backward in time, as needed). Then we copy the pixel values from the accessed picture to the background layer for the given pixel. Certain implementations form only partial disparity curves by analyzing selected temporal images from a given view. One such implementation analyzes every other temporal image to compute partial disparity curves. Other such implementations actually analyze different temporal images for different pixels. For example, for odd pixels an implementation analyzes odd temporal images (that is, images from the view under consideration for time t-1 , t-3, ... , and t+1 , t+3, ...), and for even pixels analyzes even temporal images.

Various of these methods are not necessarily trying to maintain a background model, although such a model may be maintained in various implementations. Rather, the above implementation temporally discovers the uncovered background information in neighboring frames based on depth change.

4. Compositing in Rendering

The compositing method combines the warped frames from different layers and different views. The emphasis of each reference view is defined by its angular distance as described herein below. The angular distance can be determined as follows.

For a pixel (u, v), we estimate its three dimensional location by Equation (3), i.e., p=(x,y,z)'. We know the optic focal center for reference view i as Ori, i=1 ,2. Meanwhile, we are given the optic focal center for the synthesized view as Os, where Oh and Os can be estimated from the camera parameters, i.e., Oj = -Rj'tj, with Rj denoting the rotation matrix, t denoting the translation vector, and Rj' denoting the transpose of Rj. Then we calculate the angular distance of the three dimensional point p=(x,y,z)' for each reference view by cos(angle(Ori-p-Os))^Λ-q, q>2, i=1 ,2.

The smaller the angular distance, then the smaller the angle, and the closer the reference view is to the synthesized view, resulting in more emphasis of the reference view. Here, in this implementation, we use a "winner-take-all" approach to composite different reference views. For each reference view, we have the background layer, the main layer, and its boundary layer. Pixel blending is realized by Z-buffer, which is known. The pixel blending involves taking the pixel with the smallest depth because that pixel is most likely to be in the foreground. FIG. 9 shows the compositing framework from two reference views. With respect to a reference view 1 , we perform main layer rendering 910, background layer rendering 915, and boundary layer rendering 920. With respect to a reference view 2, we perform main layer rendering 950, background layer rendering 955, and boundary layer rendering 960. Blending 980 is then performed to, for example, obtain a blended image.

Herein below we discuss the procedure using different rendering methods such as, for example, splatting and mesh warping.

The splatting method will render the novel view pixel-by-pixel varying the reconstruction kernel size (which can be considered to be the window function in splatting) depending on the disparity and normal vector orientation of the reference pixel. The splatting kernel size for the background/main layer differs from the boundary layer, because the latter will be warped to the dis-occluded area in the synthesized view. The hole size can be estimated based on the depth discrepancy, which decides the reconstruction kernel size in splatting.

The (triangular) mesh-based method converts each 2x2 section of the depth map into two triangles if the depth difference between either pair of diagonal vertexes is less than the given threshold. When rendering the main layer, the depth discontinuities are handled by removing the corresponding triangles. The background layer is rendered to fill in the holes in the novel view. Eventually, the boundary layer is rendered with those triangles removed by the main layer. In one implementation, to fill in the remaining holes after the three-layered rendering, we run the simplest approach which would examine all the pixels bordering the hole, and copy the one that is the farthest away. The one that is farthest away has the biggest depth and is most likely to be in the background.

Note that a layer may have more than one value for a given pixel. In one implementation, for example, a background layer (915) retains two values for a given pixel. One value represents a foreground value, and a second value represents a background value. The analysis for a given pixel, in a given warped view, may not be able to accurately determine whether or not the given pixel (for example, located at the boundary of a hole) is in the foreground or the background. Accordingly, the analysis may retain the foreground value and the background value that is produced from, for example, the disparity-curve analysis. A second view (950, 955, 960) may provide additional information allowing the blending operation (980) to determine whether the given pixel is in the foreground or the background.

It should be clear that the implementation of FIG. 9 need not produce two final synthesized images (a first from Reference view 1 , and a second from Reference view 2) prior to performing the blending operation (980). Although this is possible in some implementations, the implementation of FIG. 9 performs the blending operation (980) using six "images". The six images are the output from blocks 910, 915, 920, 950, 955, and 960. Note that these "images" need not be full images. For example, in one implementation the background layers (915, 955) need only include the information for the pixels that are part of the hole boundaries.

Other implementations may, for example, combine warped frames from only a single view, but from multiple layers. For example, an implementation may produce only a single warped main layer, a single background layer, and a single boundary layer. These three layers from the same view may be combined to form a composite image.

FIG. 10 is a non-limiting flow diagram of an implementation of a method 1000 for encoding and transmitting image data for a view obtained using depth-image-based rendering.

At step 1005, information from a reference image is accessed. The reference image is for a reference view at a particular time.

At step 1010, information from a second image is accessed. The second image is for a different time than the particular time.

At step 1015, an additional image is created based on the information from the reference image and on the information from the second image. The additional image is for an additional view that is different from the reference view and is for the particular time.

At step 1020, the reference image, the second image, and the additional image are encoded.

At step 1025, the encoded reference image, the encoded second image, and the encoded additional image are transmitted.

Note that many implementations need only perform operations 1005, 1010, and 1015 of the method 1000. That is, these implementations are directed toward creating the additional image. These implementations may be performed at an encoder or at a decoder, for example. On the encoder side, there are several uses. For example, the additional image may be encoded and transmitted, and/or the additional image may be used as a reference for encoding another image. The additional image may be, for example, a synthesis of a view that is to be encoded, and the synthesized additional image may be used as a reference for encoding that view. The encoder may also may also signal to a decoder, using signaling information such as values for a syntax, which information was used to synthesize the additional image. The signaling information may indicate, for example, the view that was used to synthesize the additional image, the view location of the additional image, and any other (for example, temporal information) that was used in the synthesis of the additional image. The decoder can then perform the synthesis of the additional view at the decoder and use that synthesized additional view to decode the encoded view.

On the encoder side, the additional image may be used as a reference for synthesizing yet another image. In such an implementation, the additional image is warped, and a background layer and a boundary layer are generated.

FIG. 11 is a non-limiting flow diagram of an implementation of step 1015 of method 1000 of FIG. 10.

At step 1105, the additional image is synthesized based on the reference image and estimating a value for a pixel in a dis-occluded portion (occurring in, e.g., a background portion) of the additional image using the information from the second image.

At step 1110, a pixel in the reference image that corresponds to the pixel in the dis-occluded portion of the additional image is identified. Such identification may involve, but is not limited to, for example, coherence and consistence of neighboring depth and color information.

At step 1115, depth information for the pixel in the reference image is compared with depth information for a corresponding pixel in the second image.

At step 1120, it is determined whether or not the pixel in the second image is a background/foreground pixel based on the comparing. At step 1125, a size of the dis-occluded portion is refined using depth information, by comparing depth information for a pixel in the dis-occluded portion with depth information for a neighboring pixel outside of the dis-occluded portion, and determining whether to include the neighboring pixel in the dis-occluded portion based on the comparing. ₂

At step 1130, the value of the pixel in the dis-occluded portion is estimated based on a value of the corresponding background/foreground pixel in the second image.

FIG. 12 is a non-limiting flow diagram of an implementation of a method 1200 for demodulating and decoding image data for a view obtained using depth-image-based rendering.

At step 1205, a signal is received and demodulated. The signal includes an encoded reference image and an encoded second image. The reference image is for a reference view at a particular time. The second image is for a different time than the particular time.

At step 1210, the encoded reference image and the encoded second image are decoded.

At step 1215, information from the decoded reference image is accessed.

At step 1220, information from the decoded second image is accessed. At step 1225, an additional image is created based on the information from the decoded reference image and on the information from the decoded second image. The additional image is for an additional view that is different from the reference view and is for the particular time.

At step 1230, at least the additional image is displayed on a presentation device.

Reference in the specification to "one embodiment" or "an embodiment" or "one implementation" or "an implementation" of the present principles, as well as other variations thereof, mean that a particular feature, structure, characteristic, and so forth described in connection with the embodiment is included in at least one embodiment of the present principles. Thus, the appearances of the phrase "in one embodiment" or "in an embodiment" or "in one implementation" or "in an implementation", as well any other variations, appearing in various places throughout the specification are not necessarily all referring to the same embodiment.

It is to be appreciated that the use of any of the following T₁ "and/or", and "at least one of, for example, in the cases of "A/B", "A and/or B" and "at least one of A and B", is intended to encompass the selection of the first listed option (A) only, or the selection of the second listed option (B) only, or the selection of both options (A and B). As a further example, in the cases of "A, B, and/or C" and "at least one of A, B, and C", such phrasing is intended to encompass the selection of the first listed option (A) _{2 ]}

only, or the selection of the second listed option (B) only, or the selection of the third listed option (C) only, or the selection of the first and the second listed options (A and B) only, or the selection of the first and third listed options (A and C) only, or the selection of the second and third listed options (B and C) only, or the selection of all three options (A and B and C). This may be extended, as readily apparent by one of ordinary skill in this and related arts, for as many items listed.

We thus provide one or more implementations having particular features and aspects. Certain features and aspects relate to using temporal information in the synthesis of additional images, particularly using temporal information in producing a background layer that is used in the synthesis of an additional image. However, other features and aspects have been disclosed. Further, features and aspects of described implementations may also be adapted for other implementations. For example, additional layers may be used, and image-based rendering may be combined with model-based or geometry-based rendering. Although implementations described herein may be described in a particular context, such descriptions should in no way be taken as limiting the features and concepts to such implementations or contexts.

The implementations described herein may be implemented in, for example, a method or a process, an apparatus, or a software program. Even if only discussed in the context of a single form of implementation (for example, discussed only as a method), the implementation of features discussed may also be implemented in other forms (for example, an apparatus or program). An apparatus may be implemented in, for example, appropriate hardware, software, and firmware. The methods may be implemented in, for example, an apparatus such as, for example, a processor, which refers to processing devices in general, including, for example, a computer, a microprocessor, an integrated circuit, or a programmable logic device. Processors also include communication devices, such as, for example, computers, cell phones, portable/personal digital assistants ("PDAs"), and other devices that facilitate communication of information between end-users. Implementations of the various processes and features described herein may be embodied in a variety of different equipment or applications, particularly, for example, equipment or applications associated with data encoding and decoding. Examples of equipment include video coders, video decoders, video codecs, web servers, set-top boxes, laptops, personal computers, cell phones, PDAs, and other „

22

communication devices. As should be clear, the equipment may be mobile and even installed in a mobile vehicle.

Additionally, the methods may be implemented by instructions being performed by a processor, and such instructions may be stored on a processor-readable medium such as, for example, an integrated circuit, a software carrier or other storage device such as, for example, a hard disk, a compact diskette, a random access memory ("RAM"), or a read-only memory ("ROM"). The instructions may form an application program tangibly embodied on a processor-readable medium. Instructions may be, for example, in hardware, firmware, software, or a combination. Instructions may be found in, for example, an operating system, a separate application, or a combination of the two. A processor may be characterized, therefore, as, for example, both a device configured to carry out a process and a device that includes a computer readable medium having instructions for carrying out a process.

As will be evident to one of skill in the art, implementations may produce a variety of signals formatted to carry information that may be, for example, stored or transmitted. The information may include, for example, instructions for performing a method, or data produced by one of the described implementations. For example, a signal may be formatted to carry as data the rules for writing or reading the syntax of a described embodiment, or to carry as data the actual syntax-values written by a described embodiment. Such a signal may be formatted, for example, as an electromagnetic wave (for example, using a radio frequency portion of spectrum) or as a baseband signal. The formatting may include, for example, encoding a data stream and modulating a carrier with the encoded data stream. The information that the signal carries may be, for example, analog or digital information. The signal may be transmitted over a variety of different wired or wireless links, as is known.

A number of implementations have been described. Nevertheless, it will be understood that various modifications may be made. For example, elements of different implementations may be combined, supplemented, modified, or removed to produce other implementations. Additionally, one of ordinary skill will understand that other structures and processes may be substituted for those disclosed and the resulting implementations will perform at least substantially the same function(s), in at least substantially the same way(s), to achieve at least substantially the same result(s) as the implementations disclosed. Accordingly, these and other implementations are contemplated by this application and are within the scope of the following claims.

Claims

CLAIMS:

1. A method comprising: accessing (1005) information from a reference image, the reference image being for a reference view at a particular time; accessing (1010) information from a second image, the second image being for a different time than the particular time; and creating (1015) an additional image based on the information from the reference image and on the information from the second image, the additional image being for an additional view that is different from the reference view and being for the particular time.

2. The method of claim 1 wherein creating the additional image comprises: synthesizing (1105) the additional image based on the reference image, the additional image including a dis-occluded portion; and estimating (1 130) a value for a pixel in the dis-occluded portion using the information from the second image.

3. The method of claim 2 wherein at least part of the dis-occluded portion represents a background portion of the additional image (1120).

4. The method of claim 2 wherein estimating the value comprises forming one or more of a background layer or a boundary layer.

5. The method of claim 2 further comprising refining (1 125) a size of the dis-occluded portion using depth information.

6. The method of claim 5 wherein refining comprises: comparing (1125) depth information for a pixel in the dis-occluded portion with depth information for a neighboring pixel outside of the dis-occluded portion; and determining (1 125) whether to include the neighboring pixel in the dis-occluded portion based on the comparing. ^

7. The method of claim 2 further comprising: accessing (1010) information from a third image, the third image being for the different time; and estimating (1130) a value for a second pixel in the dis-occluded portion using the information from the third image.

8. The method of claim 2 wherein the second image is from the reference view, and estimating the value of the pixel in the dis-occluded portion comprises: identifying (1110) a pixel in the reference image that corresponds to the pixel in the dis-occluded portion of the additional image; comparing (1115) depth information for the pixel in the reference image with depth information for a corresponding pixel in the second image; determining (1120) that the corresponding pixel in the second image is a background pixel based on the comparing; and estimating (1130) the value of the pixel in the dis-occluded portion based on a value of the corresponding background pixel in the second image.

9. The method of claim 8 wherein at least part of the dis-occluded portion of the additional image represents a background portion of the additional image.

10. The method of claim 8 wherein at least part of the dis-occluded portion of the additional image represents a boundary portion of the additional image.

11. The method of claim 2>wherein the second image is from the reference view, and estimating the value of the pixel in the dis-occluded portion comprises: identifying (1110) a pixel in the reference image that corresponds to the pixel in the dis-occluded portion of the additional image; comparing (1115) depth information for the pixel in the reference image with depth information for a corresponding pixel in the second image; determining (1120) that the corresponding pixel in the second image is a foreground pixel based on the comparing; and estimating (1130) the value of the pixel in the dis-occluded portion based on a value of the corresponding foreground pixel in the second image.

12. The method of claim 1 wherein the method is performed in at least one of an encoder, a decoder, a post-processor subsequent to the reference image being decoded, and a pre-processor prior to the reference image being encoded.

13. An apparatus comprising: means for accessing information from a reference image, the reference image being for a reference view at a particular time; means for accessing information from a second image, the second image being for a different time than the particular time; and means for creating an additional image based on the information from the reference image and on the information from the second image, the additional image being for an additional view that is different from the reference view and being for the particular time.

14. A processor-readable medium having stored thereon instructions for causing a processor to perform at least the following: accessing (1005) information from a reference image, the reference image being for a reference view at a particular time; accessing (1010) information from a second image, the second image being for a different time than the particular time; and creating (1015) an additional image based on the information from the reference image and on the information from the second image, the additional image being for an additional view that is different from the reference view and being for the particular time.

15. An apparatus comprising a processor configured to perform at least the following: accessing (1005) information from a reference image, the reference image being for a reference view at a particular time; accessing (1010) information from a second image, the second image being for a different time than the particular time; and creating (1015) an additional image based on the information from the reference image and on the information from the second image, the additional image being for an additional view that is different from the reference view and being for the particular time.

16. An apparatus comprising: a memory interface (306) for accessing information from a reference image, the reference image being for a reference view at a particular time, and accessing information from a second image, the second image being for a different time than the particular time; and a synthesizer (307) for creating an additional image based on the information from the reference image and on the information from the second image, the additional image being for an additional view that is different from the reference view and being for the particular time.

17. An apparatus comprising: a rendering unit (305) configured: to access information from a reference image, the reference image being for a reference view at a particular time, to access information from a second image, the second image being for a different time than the particular time, and to create an additional image based on the information from the reference image and on the information from the second image, the additional image being for an additional view that is different from the reference view and being for the particular time; an encoder (310) configured to encode the reference image, the second image, and the additional image; and a transmitter (315) configured to transmit the encoded reference image, the encoded second image, and the encoded additional image.

18. An apparatus comprising: means for accessing information from a reference image, the reference image being for a reference view at a particular time; means for accessing information from a second image, the second image being for a different time than the particular time; means for creating an additional image based on the information from the reference image and on the information from the second image, the additional image being for an additional view that is different from the reference view and being for the particular time; means for encoding the reference image, the second image, and the additional image; and means for transmitting the encoded reference image, the encoded second image, and the encoded additional image.

19. A method comprising: accessing (1005) information from a reference image, the reference image being for a reference view at a particular time; accessing (1010) information from a second image, the second image being for a different time than the particular time; creating (1015) an additional image based on the information from the reference image and on the information from the second image, the additional image being for an additional view that is different from the reference view and being for the particular time; encoding (1020) the reference image, the second image, and the additional image; and transmitting (1025) the encoded reference image, the encoded second image, and the encoded additional image.

20. An apparatus, comprising: a demodulator (405) configured to receive and demodulate a signal, the signal including an encoded reference image and an encoded second image, the reference image being for a reference view at a particular time, and the second image being for a different time than the particular time; a decoder (410) configured to decode the encoded reference image and the encoded second image; and a rendering unit (415) configured: to access information from the decoded reference image; to access information from the decoded second image; and to create an additional image based on the information from the decoded reference image and on the information from the decoded second image, the additional image being for an additional view that is different from the reference view and being for the particular time.

21. The apparatus of claim 20, further comprising a presentation device (420) for displaying the additional image.

22. An apparatus, comprising: means for receiving and demodulating a signal, the signal including an encoded reference image and an encoded second image, the reference image being for a reference view at a particular time, and the second image being for a different time than the particular time; means for decoding the encoded reference image and the encoded second image; means for accessing information from the decoded reference image; means for accessing information from the decoded second image; and means for creating an additional image based on the information from the decoded reference image and on the information from the decoded second image, the additional image being for an additional view that is different from the reference view and being for the particular time.

23. A method, comprising: receiving and demodulating (1205) a signal, the signal including an encoded reference image and an encoded second image, the reference image being for a reference view at a particular time, and the second image being for a different time than the particular time; decoding (1210) the encoded reference image and the encoded second image; accessing (1215) information from the decoded reference image; accessing (1220) information from the decoded second image; and creating (1225) an additional image based on the information from the decoded reference image and on the information from the decoded second image, the additional image being for an additional view that is different from the reference view and being for the particular time.