HK1010297B

HK1010297B - Method of optimal disparity estimation for stereoscopic video coding

Info

Publication number: HK1010297B
Application number: HK98111264.9A
Authority: HK
Inventors: 陈学敏; 刘承德
Original assignee: 通用仪器公司
Priority date: 1996-08-06
Filing date: 1998-10-15
Publication date: 2002-10-25

Description

Optimized disparity estimation method for stereoscopic video coding

The invention relates to a method for coding a stereoscopic digital video signal to improve the image quality. And more particularly to a method for optimizing an estimate of the difference between left and right field-of-view pixel intensity values.

Recently, stereoscopic video transmission formats have been Proposed, such as the Moving Picture Experts Group (MPEG) MPEG-2 Multi-View Profile (MVP) system, described in the document ISO/IEC JTC1/SC29/WG 11N 1088 entitled "deployed Draft Engine No.3 to 13818-2(Multi-view Profile)" in 11 months 1995, which is hereby incorporated by reference. Stereoscopic video provides the same images with slight differences to combine images with a depth of a scene, thus creating a three-dimensional (3D) effect. In such systems, the dual cameras are two inches away, and record an event as two separate video signals. The spacing between the cameras is similar to the distance between the left and right eyes of a human eye. Also for some stereoscopic video cameras, two lenses are mounted in one camera head and thus can be moved synchronously, e.g., sweeping through the image. The two video signals may be transmitted and combined in a receiver to produce an image having a depth of the scene corresponding to the scene as seen by the ordinary human eye.

The MPEG MVP system includes two video signals transmitted as a multiplexed signal. First, the base layer represents the left field of view of the three-dimensional object. Second, the enhancement (i.e., additional) layer represents the right field of view of the target. Since the left and right views have the same objects and only slightly differ, there is usually a large degree of correlation in the video images of the base and enhancement layers. This correlation can be used to compress the data of the enhancement layer relative to the base layer, thereby reducing the amount of data that needs to be transmitted at the enhancement layer to achieve a given picture quality.

The MPEG MVP system includes three video images; specifically, intra-coded pictures (I pictures), predictive-coded pictures (P pictures), and bidirectionally predictive-coded pictures (B pictures). I pictures describe a single video picture completely without reference to any other picture. At the base layer, a P picture is predicted based on a previous I picture or P picture. The B picture is predicted from the immediately preceding I or P picture and the immediately succeeding I or P picture. The base layer may be encoded according to the MPEG-2 standard, the details of which are found in the document ISO/IECJTC1.SC29/WG11 NO702, entitled "Information Technology-genetic Coding of moving Pictures and Associated Audio, Recommendation H.262," published 25.3.1994, hereby incorporated by reference.

In the enhanced layer, the P picture can be predicted from the most recently decoded picture in the enhanced layer regardless of the type of picture, or predicted from the most recently decoded picture in the enhanced layer regardless of the type of picture. Also, for B pictures in the enhanced layer, the forward reference picture is the most recently decoded picture in the enhanced layer and the backward reference picture is the most recently decoded picture in the base layer in display order. The picture of the enhancement layer can be predicted from the picture of the base layer in a cross-layer prediction method known as differential prediction. Prediction from one frame to another in a layer is known as temporal prediction.

However, for differential prediction of enhancement layer frames, errors are often introduced due to imbalances between base and enhancement layer pixel intensity values. This imbalance may be caused by performance variations between the base and enhancement layer cameras, which makes the difference estimation and prediction more difficult. Moreover, the imbalance may be caused by a disappearance of the scene or a significant change in brightness and/or contrast in the scene, such as a strong flash. The result of this imbalance in cross-channel brightness is that image quality can be significantly compromised.

Certain schemes have been developed for reducing the effects of cross-channel luminance imbalance. For example, the document ISO/IEC JTC1/SC29/WG11 MPEG 96, published by R.Franich et al in Firenze, 1996, 3 months, entitled "balance compression for stereoscopical image Sequences," discusses linear transformations for adjusting the right-view image Sequence to obtain the same high-density median and variance as the left-view channel. Puri et al, 11.1995, in Dallas, entitled "Gain-Corrected stereosystematic coding Using SECASIC for MPEG-4 Multiple Current Streams," document ISO/IECJTC1/SC29/WG11 MPEG 95/0487, discussed the use of Gain and offset to correct the right field of view. However, these schemes do not minimize the least square error of the luminance imbalance.

It would therefore be advantageous to provide disparity estimation methods for stereoscopic video systems, such as the MPEG MVP system, that minimize the effects of cross-channel luminance imbalances due to camera variations and significant changes in the brightness or contrast of the scene. Moreover, the scheme will be implemented either entirely at the image level or locally at the macroblock level. In addition, the scheme should be compatible with efficient predictive coding of MPEG-2 video sequences and similar coding protocols. The present invention provides the above and other advantages.

According to one aspect of the present invention, there is provided a method of reducing cross-channel luminance imbalance in an enhancement layer picture of a stereoscopic video signal, wherein a search window is provided comprising at least a portion of pixels in said enhancement layer picture; and providing a reference window comprising at least a portion of the pixels in the reference image of the base layer of said stereoscopic video signal; the method is characterized in that: affine transformation coefficients a, b of said reference window are determined which minimize the least square error between the luminance values of the pixels of said search window and said reference window.

According to yet another aspect of the present invention, there is provided a method of reducing cross-channel luminance imbalance in an enhancement layer picture of a stereoscopic video signal, wherein a plurality of windows are provided comprising respective portions of pixels in said enhancement layer picture; and providing a corresponding plurality of reference windows comprising respective portions of pixels in a reference image of a base layer of said stereoscopic video signal; the method further comprises the following steps: affine transformation coefficients a, b are determined which minimize the sum of the least square errors between the luminance values of the pixels of said enhancement layer picture window and said corresponding reference window.

According to another aspect of the present invention, there is provided a method of decoding a stereoscopic video signal with reduced cross-channel luminance imbalance in enhancement layer pictures of the signal, wherein: -retrieving affine transformed coefficients a, b from said stereoscopic video signal; determining said affine transformation coefficients by minimizing the least square error between the luminance values of the pixels of a search window comprising at least a portion of the pixels of said enhancement layer picture and of the reference window comprising at least a portion of the pixels of the reference picture of the base layer of said stereoscopic video signal; and recovering said search window pixel data using said affine transform coefficients.

Other advantages and features of the present invention will become apparent from the following detailed description of the embodiments with reference to the accompanying drawings.

Fig. 1 is a block diagram of a stereo encoder according to the present invention.

Fig. 2 shows a macroblock level optimization scheme according to the present invention.

Fig. 3 illustrates a process used in a decoder according to the present invention.

Fig. 4 shows a picture level optimization scheme according to the invention.

Fig. 5 shows another embodiment of the picture level optimization scheme according to the invention.

A method is provided for optimizing an estimate of the difference between the luminance values of right and left field pixels in a stereoscopic video signal.

Fig. 1 is a block diagram providing a stereo encoder of the present invention. Such an encoding scheme may be implemented with a scalable algorithm for MPEG-2 time. The encoder is indicated generally at 100. The left view sequence carried in the base layer is encoded according to a conventional MPEG-2 curve. The right-view sequence carried in the enhancement layer has the same algorithm as MPEG-2 temporal enhancement layer coding.

Left-view frame buffer 105 of encoder 100 receives base layer pixel data represented by vector X, and right-view frame buffer 130 receives enhancement layer data represented by vector Y. The left and right field-of-view pixels are provided to a difference estimator 115 for processing as described in detail below. The disparity estimator 115 provides a disparity vector V ═ (VX-VY) and disparity estimation parameters a, b to the predictor 120.

Specifically, the disparity estimator 115 performs affine transformation, and a and b are affine transformation coefficients. In affine transformations, one finite point is mapped to another finite point. In addition, the coefficient "a" indicates contrast, and the coefficient "b" indicates brightness of pixel data. The transform coefficients are carried in the stereoscopic video data stream for reconstruction of the enhancement layer picture in the encoder 125. The disparity vector V (VX, VY) is the difference in position between the corresponding pixel macroblocks of the base and enhancement layers used in the encoder to reconstruct the disparity predicted enhancement layer picture. Specifically, the pixel coordinates for the search window macroblock in the enhancement layer are (XS, YS), and the pixel coordinates for the corresponding reference window macroblock in the base layer are (Xr, Yr), and the disparity vector is V ═ V (VX, VY) ═ V (XS-Xr, YS-Yr). Thus, the disparity vector is the position or translation difference between the search window and the reference window. Typically, the pixel coordinates of a pixel macroblock take the leftmost and uppermost coordinates of the block. The disparity vector may be transmitted in the right-view channel data stream, which is used in the encoder to reconstruct the disparity predicted enhancement layer picture. Predictor 120 provides a signal aX + b extracted from enhancement layer pixel data Y at adder 140 to provide the differentiated right-view pixel data. The right-view pixel data Y- (aX + b) of the difference is then supplied to the terminal 142.

At the base layer, a Motion Compensation (MC) and Discrete Cosine Transform (DCT) encoder 110 receives and conventionally encodes left-view pixel data X. The MC/DCT coder 110 then outputs the motion vectors and DCT coefficients to the encoder 125. The predictor 120 also receives MC data from the MC/DCT encoder 110. Encoder 135 receives right-view pixel data Y and performs motion compensation and/or I-frame encoding. Encoder 135 then outputs the I frame pixels to terminal 143 or outputs the motion vectors to encoder 125. The switch 145 provides the DCT encoder with the differentiated right-view pixel data Y- (aX + b) ═ Y-aX-b at the terminal 142, or with the I-frame encoded right-view pixel data at the terminal 143. When terminal 143 is selected, the disparity estimation process is bypassed. This may also be required, for example, when the Least Squares Error (LSE) is determined to be greater than a given value, or when an I picture is required for a picture structure set. The DCT coder 150 processes the pixel data to provide corresponding transform coefficients to the encoder 125.

At encoder 125, the left and right view motion compensation vectors, DCT coefficients and disparity vectors are encoded using Differential Pulse Code Modulation (DPCM), run length coding and harvard coding to produce left view channel and right view channel data streams. The left and right view channels are then multiplexed in a multiplexer (not shown) together with the difference estimation parameters a, b and modulated for transmission with a suitable carrier frequency signal.

In accordance with the present invention, the disparity estimator 115 minimizes the error in the brightness of the right-view pixels according to a minimum average attenuation criterion. Note that the use of the word "error" only means that the left field of view data is used as the base line. Thus, the attenuation of the right field of view data is simply unbalanced or differential relative to the left field of view data. Specifically, the difference estimator 115 sets the error E (abs (Y-aX-b)²Minimum where "abs" represents an absolute value. The disparity estimator 115 uses an optimized affine transformation and block matching process, where block matching is done at the macroblock level. For example, for NTSC format, a video frame may be divided into 30 slices, each having 44 macroblocks. Thus, the entire NTSC frame includes 1320 macroblocks. For PAL format, there are 1584 macroblocks. Also, typically a macroblock consists of a 16 x 16 block of pixels, and in the MPEG standard consists of an 8 x 8 block of pixels.

The search window is defined as the current macroblock in the right view image to be compared with a different macroblock in the reference window of the left view image. Specifically, the left-view image for comparison is the next image, or the closest image, in display order. And the retrieval range (i.e., the size of the reference window) is determined by the motion of the stereo camera. Typically, the horizontal camera motion is greater than the vertical motion, so the reference window can be designed to have a width greater than its height. For example, the search window may be a 16 × 16 integer number of pixels, while the reference window may be some integer number of pixels in the range 32 × 32 to 64 × 48. Of course, various search and reference windows may be used, and there is no need for a search window to correspond to a particular macroblock size.

Fig. 2 shows a macroblock level optimization scheme according to the present invention. In this embodiment, the least squares error optimization process of the disparity estimator 115 is performed on a single macroblock of the right field of view image. The left view image 200 includes a reference window 210 and the right view image 220 includes a search window 230. While only one reference window and search window are shown, the entire right-view image 220 may be broken down into multiple search windows in order to minimize cross-channel luminance imbalance across the entire right-view image. In this case, a corresponding additional reference window will be provided in the left view image. And the reference windows may overlap.

Let Yi be the luminance value (i.e., intensity) of 256 pixels in the 16 × 16 pixel search window 230 and let Xj, i be the intensity of the jth 16 × 16 macroblock of the reference window for i-1-256. Thus, the subscript "j" represents a specific range for a given reference window, and the subscript "i" represents a specific pixel in a given search window. For example, for a 16 × 16 pixel search window and a 64 × 48 pixel reference window, the search window would correspond to (64-16+1) × (48-16+1) ═ 49 × 33 ═ 1617, unlike the 16 × 16 range in the reference window.

For the jth range of a given reference window, the disparity estimation parameters aj and bj can be determined, which minimize this number

The process of performing function 240 provides contrast, a, brightness, b settings that cause affine transformed Xj, i values to have small squared distances. The smallest Ej occurs when the local derivative corresponding to aj and bj is 0, i.e., the smallest Ej occursAndit appears when the following conditions are satisfiedThe above-described calculation can be performed by a known calculation technique. The "best" affine transform coefficients a and b (i.e., the coefficients with the smallest error over all j possible reference window blocks) and the best matching block X, 1, X, 2, … X, 256 (i.e., X, i) in the reference window are determined by the following conditions

Note that a > 0 is necessary, and a is not set to 1. Then, for a block of 16 × 16 to 256 pixels, after affine transformation, if the pixel X 'aX + b > 255, X' is set to 255, and if the pixel X 'aX + b < 0, X' is set to 0. If one of these conditions occurs, the most checked will beSmall square error calculation to ensure abs (Y-X')²＜abs(Y-X)²Otherwise, if abs (Y-X')²≥abs(Y-X)²If a is 1 and b is 0.

Once a and b (where "") for a given search window are found, the optimization criteria are represented. The corresponding disparity vector may also be determined as discussed. This process is repeated for each search window in the enhancement layer picture. For example, for an NTSC format having 1320 macroblocks per picture, Ej is minimized for each of the 1320 macroblocks. Thus, for each macroblock in the enhanced picture, a and b, and (VX, VY) are stored and transmitted in a data stream that is used in the decoder to reconstruct the right view picture. As can be seen, for the preceding minimization procedure, an optimized disparity vector (VX, VY) is obtained for each search window macroblock in the right view image. In addition, optimized contrast and brightness settings a and b are found for each macroblock, respectively.

The disadvantage of this method is that it is complicated to implement. First, the retrieval algorithm is more complex than the conventional block matching algorithm due to the additional computations that need to be performed. Second, for each search window macroblock (see fig. 1) must be frequently loaded with coefficients a and b in the data stream. Finally, the method requires user data in the MPEG-2 picture level syntax or some user-defined syntax.

To reduce the complexity of the per macroblock computation and the data constancy, the optimization parameters a and b are increased. For example, a and b may be determined for each small block of a frame or a field or for macro blocks of various sizes. In this way, the overall number of coefficients carried in the data stream for each enhancement layer picture can be reduced. Also in a feedback method, a and b may be recalculated until a given criterion, such as target error, is obtained.

Fig. 3 shows a method used in an encoder according to the invention. In block 305, encoded left field luminance pixel data received via the stereoscopic video signal is stored in a memory. The encoded data is reverse fixed length encoded and reverse huffman encoded using conventional methods (not shown). The transformed coefficients and pixel data are provided to block 315 for inverse quantization of the encoded data. The inverse quantization function 315 uses the quantization parameter provided by block 335, which may be from a look-up table, for example. The inverse quantized differential right pixel data is stored in memory at block 320 and processed with an inverse DCT function at block 325 to provide uncorrected differential right field pixel data.

After recovery from the stereoscopic video signal in block 340, the reference data X for the detected left view image is stored in memory in block 345, and the encoded reference data X is used for prediction. Block 345 receives the disparity vector provided by block 360. The reference data X is then affine transformed in block 350 according to the affine transform coefficients a, b received by the function block 365. The predicted left field reference data is stored in memory at block 355 and then summed with the uncorrected differential right field pixel data to provide luminance corrected right field pixel data to a buffer according to the present invention, which is then output to a data buffer for subsequent processing and display at block 330.

Fig. 4 shows an image level optimization scheme according to the invention, in which the above-described minimum square error technique is used in block matching, as described above, with the left view image 200 including the reference window 210 and the right view image 220 including the search window 230. And only one search window and reference window are displayed, it should be understood that the following process can be applied to multiple search and reference windows to minimize cross-channel luminance imbalance on the right-view image.

A conventional block matching algorithm is first performed in block 400 to determine a disparity vector (VX, VY) for each of the n macroblocks in the right-view image. For example, for NTSC format images, n is 1320. In conventional block matching algorithms, the block of pixels to be matched is compared to other blocks of pixels to determine who is most similar to the image content.

Then, in block 410, the sum of the least squares error for each search window is used to find good overall contrast and brightness (i.e., coefficients a, b) between the left and right field images. Thus, for a given right view image, the disparity vectors and compensated squares for all search window macroblocks can be determined using conventional block matching algorithms.

Let y1, y2, … yn (i.e., yi, i ═ 1 to n) be the corresponding values for the n right-view macroblocks, and let the data compensated for the corresponding disparity of the left-view image 200. The following values may then be determinedThe smallest coefficients a and b (i.e., a and b). Thus, instead of providing a pair of coefficients for a search window macroblock, a pair of coefficients a and b is provided for the entire image.

The error minimization technique also provides good contrast and brightness settings that minimize the least squares error of the affine transformed left view image associated with the right view image when the following equation is satisfied:furthermore, it is possible to provide a liquid crystal display device,

as discussed in connection with fig. 2, the coding complexity and periodicity of embodiments of the present invention is greatly reduced when each individual macroblock is optimized. In particular, since the parameters a and b must be saved and transmitted only in the picture level user data, there is a constant reduction. However, with the present encoding process, a buffer is required to store the encoded information for a frame (or field) before coefficients a and b are determined, since the user data, including a and b, is transmitted in the data stream before the encoded image data itself.

It is also noted that the present technique may provide picture level optimization for other block sizes, such as a tile or portion of a tile.

Decoding of the optimization scheme of fig. 4 may be performed using the decoding method of fig. 3, where X is predicted left view reference data

Fig. 5 shows another embodiment of the picture level optimization scheme according to the invention. The right view image 220 includes a window 235 corresponding to the location of the reference region 215 in the left view image 200. The window 235 is not defined as a "retrieve" window because there is no retrieval process. In contrast, window 235 is a direct conversion of reference region 215 of left view image 200 to right view image 220. In this embodiment, the LSE optimization parameters a X and b are found directly from the left view image X and the right view image Y by minimizing the sum of the least squared errors over each left view window in block 500. Next, in block 510, disparity vectors (VX, VY) are determined for each macroblock in Y using square matching between the affine transformed left view image aX + b (shown at 505) and the right view image Y. An advantage of this embodiment of the invention is that no buffer is required for storing image data prior to transmission.

After obtaining the disparity vectors (VX, VY) and the optimized parameters a and b, disparity estimation is performed in the same way as motion estimation. However, the reference frame is now from the encoded left view sequence, not from the right view itself, and the best disparity compensated block is obtained from an affine transformation of the corresponding reference block.

The decoding of the optimization scheme of fig. 5 may be carried out using the decoding method of fig. 3.

It can thus be seen that the present invention provides a method and apparatus for optimizing disparity estimation in a stereoscopic video encoder. In one embodiment, the least squares error occurs for each macroblock in the right view image separately. Instead, optimization may be provided after the blocks of the right field image are matched with the left field image. Or block matching between affine transformations of the left view image and the right view image after the least square error optimization.

Other variations of the present invention are also possible. For example, one technique is used to optimize one portion of the image, while another technique is used to optimize another portion of the image. Alternatively, the selection of which technique is based on image criteria, such as image type, sequence structure or display order of images in the transmission, image complexity, image quality, bandwidth requirements and quantization level, etc.

In other variations, LSE optimization may be implemented in a closed loop system to achieve a constant error level or target error range. For example, in the first alternative, a smaller search window may be used. If the resulting error is less than the predicted level, the optimization can be repeated with larger macroblock sizes. In this manner, the number of estimation coefficients that must be transmitted for each image can be reduced while still maintaining an acceptable cross-channel luminance imbalance. While the invention has been described in conjunction with various specific embodiments, it is to be understood that variations and modifications may be effected by one of ordinary skill in the art without departing from the spirit and scope of the invention. The scope of the invention is determined by the claims.

Claims

1. A method of reducing cross-channel luminance imbalance in an enhancement layer picture of a stereoscopic video signal, wherein a search window is provided comprising at least a portion of pixels in said enhancement layer picture; and providing a reference window comprising at least a portion of the pixels in the reference image of the base layer of said stereoscopic video signal;

the method is characterized in that:

affine transformation coefficients a, b of said reference window are determined which minimize the least square error between the luminance values of the pixels of said search window and said reference window.

2. The method according to claim 1, further comprising the step of:

performing affine transformation on the pixel data of the reference window by using the affine transformation coefficients a and b;

differentially encoding said search window pixel data using said transformed reference window pixel data; and transmitting said differentially encoded search window pixel data with said stereoscopic video signal for reconstructing said enhancement layer image.

3. A method according to claim 1 or 2, further comprising the step of:

resizing at least one of said search window and said reference window and repeating said minimizing step until said least squares error is within a target error range.

4. A method according to claim 1 or 2, further comprising the step of:

transmitting said affine transform coefficients a, b in said stereoscopic video signal used to reconstruct said enhancement layer picture.

5. A method according to claim 1 or 2, further comprising the step of:

providing an additional search window comprising portions of pixels in said enhancement layer picture;

providing additional corresponding reference windows comprising respective portions of pixels in said reference image; and

for each of said additional search windows, a set of affine transformation coefficients a, b is determined which minimizes the least square error between the luminance values of the pixels of said search window and the corresponding reference window.

6. A method of reducing cross-channel luminance imbalance in an enhancement layer picture of a stereoscopic video signal, wherein a plurality of windows are provided comprising respective portions of pixels in said enhancement layer picture; and providing a corresponding plurality of reference windows comprising respective portions of pixels in a reference image of a base layer of said stereoscopic video signal;

the method is characterized in that:

affine transformation coefficients a, b are determined which minimize the sum of the least square errors between the luminance values of the pixels of said enhancement layer picture window and said corresponding reference window.

7. The method of claim 6, wherein said plurality of enhancement layer picture windows are search windows, further comprising the steps of:

prior to said determining step, matching said plurality of search windows to respective regions of said corresponding plurality of reference windows.

8. The method according to claim 6 or 7, further comprising the step of:

transforming said respective plurality of reference windows in accordance with said affine transform coefficients a, b to provide a plurality of transformed reference windows;

matching regions of corresponding ones of said plurality of enhancement layer picture windows and said transformed plurality of reference windows to provide matched plurality of enhancement layer picture windows; and

for each of said matched plurality of enhancement layer picture windows, a disparity vector is determined which indicates a translation difference between the matched enhancement layer picture window and the corresponding transformed reference window.

9. The method of claim 8, further comprising the step of:

transmitting said disparity vector in said stereoscopic video signal used to reconstruct said enhancement layer picture.

10. A method of decoding a stereoscopic video signal with reduced cross-channel luminance imbalance in enhancement layer pictures of the signal, characterized by:

-retrieving affine transformed coefficients a, b from said stereoscopic video signal;

determining said affine transformation coefficients by minimizing the least square error between the luminance values of the pixels of a search window comprising at least a portion of the pixels of said enhancement layer picture and of the reference window comprising at least a portion of the pixels of the reference picture of the base layer of said stereoscopic video signal; and

recovering said search window pixel data using said affine transform coefficients.

11. The method of claim 10, wherein said search window pixel data is carried in said stereoscopic video signal as differentially encoded data, further comprising the steps of:

retrieving said reference window pixel data from said stereoscopic video signal;

providing reference window pixel prediction data using said reference window pixel data;

affine transforming said reference window pixel prediction data in accordance with said affine transform coefficients; and

summing said differentially encoded data with said affine transformed reference window pixel prediction data to recover said search window pixel data.