HK1114724A

HK1114724A - Method and apparatus for encoder assisted-frame rate up conversion (ea-fruc) for video compression

Info

Publication number: HK1114724A
Application number: HK08110023.2A
Authority: HK
Inventors: 维贾亚拉克施密‧R.‧拉维德朗; 史方; 戈登‧肯特‧沃克
Original assignee: 高通股份有限公司
Priority date: 2004-07-20
Filing date: 2005-07-20
Publication date: 2008-11-07

Description

Method and apparatus for encoder assisted frame rate up conversion (EA-FRUC) for video compression

Priority requirements according to 35U.S.C. § 119

This patent application claims priority from provisional application No.60/589,901 entitled "encoderassied Frame Rate Up Conversion" filed on 7/20/2004, which is assigned to its assignee and is hereby expressly incorporated herein by reference.

Technical Field

The embodiments described herein relate generally to digital video compression and, more particularly, relate to a method and apparatus for encoder assisted Frame Rate Up Conversion (EA-FRUC) for video compression.

Background

Video formats supporting various frame rates exist today. The following formats are currently most common, listed in order of their supported frames per second (fps): 24 (movie properties), 25(PAL), 30 (typically interlaced video), and 60 (high definition (HD), e.g., 720 p). Although these frame rates are suitable for most applications, in order to achieve the low bandwidth requirements of mobile handset video communications, frame rates are sometimes reduced to as low as 15, 10, 7.5 or 3 fps. While these low frame rates allow low end devices with lower computing power to display some video, the resulting image quality suffers from "jerkiness" (i.e., the effect of a slide show), rather than smooth motion. Also, the dropped frames often do not properly track the amount of motion in the video. For example, fewer frames should be dropped during portions of "high motion" video content such as those occurring in sporting events, while more frames should be dropped during segments of "low motion" video content such as those occurring in talk shows. Video compression is content dependent and it is desirable to be able to analyze and incorporate motion and texture features into the sequence to be encoded in order to improve video compression efficiency.

Frame Rate Up Conversion (FRUC) is a process that uses video interpolation at the video decoder side to increase the reconstructed video frame rate. In FRUC, an interpolated frame is generated with a received frame as a reference. Currently, systems implementing FRUC frame interpolation include various methods based on motion compensated interpolation and processing of transmitted motion vectors. FRUC is also used for conversion between various video formats. For example, in Telecine (Telecine) and Telecine (Inverse Telecine) applications, which is a film-to-video tape conversion technique that corrects various color frame rate differences between a film and a video, progressive video (24 frames/sec) is converted to NTSC interlaced video (29.97 frames/sec).

Another FRUC method uses Weighted Adaptive Motion Compensated Interpolation (WAMCI) to reduce block distortion due to motion compensation and block-based processing imperfections. The method is based on an interpolation process by weighted summation of a plurality of Motion Compensated Interpolation (MCI) images. In the proposed method, block distortion at block boundaries is reduced by using a similar technique as Overlapped Block Motion Compensation (OBMC). Specifically, to reduce blurring during processing of overlapping regions, the method uses motion analysis to determine the type of block motion and adaptively uses OBMC. Experimental results show that the proposed method achieves improved results with significantly reduced block distortion.

Yet another FRUC method uses a vector reliability analysis to reduce distortion due to the use of any motion vectors erroneously transmitted from the encoder. In this method, motion vectors are constructed using motion estimation, and these constructed motion vectors are compared with the transmitted motion vectors in order to determine the most desirable method for frame interpolation. In conventional up-conversion algorithms using motion estimation, the estimation process is performed using two adjacent decoded frames in order to construct a motion vector that allows interpolation of one frame. However, these algorithms do not take into account the amount of computation required for the motion estimation operation in an attempt to improve the use of transmission bandwidth. In contrast, in an up-conversion algorithm using transmitted motion vectors, the quality of the interpolated frame depends strongly on the motion vectors derived by the encoder. Using a combination of these two methods, the transmitted motion vectors are first analyzed to decide whether they can be used to construct an interpolated frame. The method for interpolation is then adaptively selected from the following three methods: local motion compensated interpolation, global motion compensated interpolation, and frame repeat interpolation.

While FRUC techniques are typically implemented as a post-processing function in a video decoder, such video encoders are typically not involved in this operation. However, in a method called encoder assisted FRUC (EA-FRUC), the encoder can determine whether transmission of specific information related to motion vectors or reference frames (e.g., residual data) can be excluded, while still having the decoder itself regenerate most of the frames without the excluded vector and residual data. For example, bi-directional predictive video coding methods have been introduced in MPEG-2 to improve B-frame coding. In the method, it is proposed to use an error criterion in order to be able to apply true motion vectors in motion compensated predictive coding. The distortion measure is based on the sum of absolute errors (SAD), but it is known that it is not sufficient to provide a true error measure, especially if it is desired to determine the amount of motion between two frames in the sequence. In addition, fixed thresholds are used to classify the change in thresholds when these thresholds should be variable, since the classification of thresholds preferably depends on the content.

The field of research of EA-FRUC is a growing area. There is an increasing interest in this area in video compression-especially in low bit rate applications such as streaming video and video telephony, especially where the transmitting end is a network node that can support high complexity applications, and the receiving end is a handheld device with power and complexity limitations. EA-FRUC can also find application in open systems where the decoder is compatible with any standard or popular video coding technique, as well as in closed systems where proprietary decoding techniques are employed.

What is desired is a method of: it provides high quality interpolated frames at the decoder side while reducing the amount of bandwidth required to transmit the information needed for interpolation and also reducing the amount of computation required to generate these frames in order to make it well suited for multimedia mobile devices that rely on low power processing.

Therefore, there is a need to overcome the above-mentioned problems.

Disclosure of Invention

These embodiments provide an encoder assisted frame rate up conversion (EA-FRUC) system that exploits video encoding and pre-processing operations at the video encoder to exploit the FRUC processing that will occur in the decoder in order to improve compression efficiency and reconstructed video quality.

In one embodiment, the processing includes: determining whether to encode a frame of a sequence of frames of video content by determining spatial activity within the frame, determining temporal activity in the frame, determining redundancy of at least one of the determined spatial activity, the determined temporal activity, and the determined spatio-temporal activity; and encoding the frame if the determined redundancy is below a predetermined threshold.

In another embodiment, the processing includes: determining whether to encode a set of frames of a sequence of frames of video content by determining spatial activity in the set of frames comprising one or more frames, determining temporal activity in the set of frames, determining redundancy of at least one of the determined spatial activity, the determined temporal activity, and the determined spatio-temporal activity; and encoding one or more frames of the set of frames if the determined redundancy falls within a set of predetermined thresholds.

In another embodiment, a computer-readable medium having stored thereon instructions for causing a computer to perform a method for constructing a video sequence comprising a sequence of frames is disclosed. The method comprises the following steps: determining spatial activity in a frame of a sequence of frames; determining a temporal activity in the frame; determining a redundancy of at least one of the determined spatial activity and the determined temporal activity; and encoding the frame if the determined redundancy is below a predetermined threshold.

In another embodiment, an apparatus for constructing a video sequence comprising a sequence of frames is also disclosed. The device includes: means for determining spatial activity in a frame of a sequence of frames; means for determining temporal activity in the frame; means for determining a redundancy of at least one of the determined spatial activity and the determined temporal activity; and means for encoding the frame if the determined redundancy is below a predetermined threshold.

In another embodiment, at least one processor configured to implement a method for constructing a video sequence comprising a sequence of frames is disclosed. The method comprises the following steps: determining spatial activity in a frame of a sequence of frames; determining a temporal activity in the frame; determining a redundancy in at least one of the determined spatial activity and the determined temporal activity; and encoding the frame if the determined redundancy is below a predetermined threshold.

Other objects, features and advantages will become apparent to those skilled in the art from the following detailed description. However, in describing the exemplary embodiments, it is to be understood that the detailed description and specific examples are given by way of illustration and not by way of limitation. Many changes and modifications may be made in the following description without departing from the spirit thereof, and it is to be understood that the description includes all such modifications.

Drawings

The invention may be more readily understood by reference to the accompanying drawings in which:

fig. 1 is a block diagram of a video encoding system implementing an encoder assisted frame rate up conversion (EA-FRUC) system, consistent with an embodiment;

FIG. 2 is a flow chart illustrating operation of the EA-FRUC system of FIG. 1;

FIG. 3 is a diagram illustrating one-pass encoding consistent with one embodiment of the EA-FRUC system of FIG. 1;

FIG. 4 is a diagram illustrating two-pass encoding consistent with one embodiment of the EA-FRUC system of FIG. 1; and

fig. 5 is a block diagram illustrating the application of EA-FRUC system 100 to a wireless system;

like numbers refer to like parts throughout the several views.

Detailed Description

Frame Rate Up Conversion (FRUC) is a technique for increasing the frame rate at the decoder end in low bit rate video transmission. Typically, the technique is a decoder operation. However, by predicting the decoder side's need for the FRUC algorithm, the video encoder can intelligently determine which frame or frames in the video sequence to drop (i.e., not send to the decoder) in order to increase the overall compression bit rate — thereby improving compression efficiency. As described herein, in one embodiment of an encoder assisted-FRUC (EA-FRUC) system, the encoder can obtain the original frame and a priori knowledge of the FRUC algorithm used at the decoder, and use the interpolated frame generated using the a priori knowledge to send additional information to assist the decoder in FRUC and improve the decisions made during interpolation. With knowledge that FRUC will be performed in the decoder, the EA-FRUC system uses video coding and pre-processing operations at the encoder side to improve compression efficiency (and thus transmission bandwidth utilization) and reconstructed video quality. In particular, information from the decoder, which may be supplemental or replacement information normally transmitted by the encoder, is provided to the decoder for use in conventional or encoder assisted FRUC.

In one embodiment, the information provided by the encoder includes the parameters: spatial (e.g., refinement, mode decision, neighborhood feature) and temporal (e.g., motion vector decision) features of the image to be interpolated at the decoder side, and difference information with respect to normal prediction (B or P) frame coding and interpolated frames resulting from FRUC processing. The frames interpolated by the FRUC process are referred to herein as "F frames".

Overview of encoder assisted FRUC

Fig. 1 illustrates a video encoding/decoding ("encoding") system 100 configured in accordance with one embodiment. The encoding system 100 includes a video encoder 104, the video encoder 104 performing processing of digital video data to optimize the data for transmission and decoding by one or more decoders. Specifically, in one embodiment, the video encoder 104 uses a video encoding algorithm to encode and compress the input original video 102, thereby reducing the bandwidth required to transmit the video 102 to the decoder 154. The compression efficiency of the video encoder 104 may be improved by various methods, one of which is by reducing the frame rate of the transmitted frames (i.e., reducing the number of frames to be transmitted). Then, a FRUC mechanism is used at decoder 154 to increase the frame rate of the decoded video stream and improve motion reproduction (discrimination). In particular, the decoder 154 uses the reference frames in the encoded video stream received from the encoder 104 to generate interpolated frames. As further described herein, the video encoder 104 "knows" the ability of the video decoder 154 to perform FRUC during an encoding operation and takes advantage of this potential to reduce the number and size of transmitted frames.

F frame analysis

In one embodiment, the encoder 104 of the encoding system 100 includes a content classification module 106 that determines the spatial and temporal complexity in the following cases: (i) in each frame of the video sequence; and (ii) between multiple frames in the video sequence. The encoder 104 uses the results of this operation to determine: (i) which frames of a sequence of frames (also referred to as a group of pictures (GOP)) may be dropped; and (ii) how many consecutive frames can be dropped between two encoded frames. By definition, each GOP is composed of an arrangement of one I picture (frame), a P picture (frame), and one or more B pictures (frames). The GOP can be used as a basic access unit, and an I frame is used as an access point to facilitate random access. It should be noted that a GOP may be composed of a variable number of frames. It is assumed that any dropped frames can be properly reconstructed as needed using known FRUC techniques in decoder 154. In one embodiment, the analysis may be performed using one of the following methods:

1. the importance of each frame in the sequence is ordered according to the activity in the sequence (e.g., slow motion versus high speed motion, flat area versus complex texture) and then all highly relevant frames in the sequence are discarded (discarded).

2. The FRUC algorithm available at decoder 154 is used at encoder 104 to construct the expected FRUC frame. The original frame is discarded if the correlation between the original frame and its interpolated version is high. Alternatively, if the original frame is encoded as a B frame and the reconstructed B frame is highly correlated with its interpolated version, the highly correlated B frame is discarded. If not, the original frame is encoded and transmitted. If the correlation between the original frame and its interpolated frame or between the B-frame and its interpolated frame is moderate, the non-redundant part of the frame is encoded as auxiliary information for improving the quality of the interpolated frame with respect to its corresponding original frame.

Fig. 2 illustrates one embodiment of a process used by the encoding system 100 to classify original content. In one embodiment, it is determined in step 208 whether there are any delay constraints on the encoding performance of the original content. For example, a real-time streaming application or encoder (e.g., a stream carrying conversational video such as video telephony) that is constrained by delay requirements typically must complete all encoding operations at once. In this case, the operation continues to step 208, where the encoding process is performed once. In contrast, as shown in step 216, for non-conversational video, such as Video On Demand (VOD), digital camera, and camcorder applications, in which encoded video is stored and thus the encoding operation is unconstrained in terms of time resources, two-pass encoding may be used for processing. As described herein, the quality and degree of content classification by the encoding system 100 for both modes varies due to these differences.

Spatial activity

With continuing reference to fig. 2 and returning to fig. 1, spatial activity is determined by content classification module 106 in step 210. In particular, the content classification module 106 determines the amount of spatial activity in the video source 102. In one embodiment, spatial activity refers to the amount of texture information such as edges, saturated colors, and high contrast objects in the image frames of a video sequence. In general, the greater the amount of texture information in the video sequence, the greater the spatial activity. In one embodiment, the following metric may be used to quantify texture information:

a. mean value: in block-based encoding, the mean of each block is compared to (i) the mean of the frame or (ii) the mean of a neighborhood of blocks of different sizes.

b. Variance: the amount of pixel variance in each macroblock can be compared to a predetermined data-dependent threshold to determine spatial activity. Alternatively, the blocks may be classified based on variance and mean measures, in which case different thresholds may be used for different mean ranges.

c. Variable block size/shape mean and variance: the mean and variance measures can be extended to variable block sizes and to objects occupying arbitrary size (and shape) regions in an image or frame.

d. Contrast ratio: (i) the ratio between the standard deviation of a block, region or object and (ii) the mean of a region or neighborhood of blocks (e.g., 3 x 3 blocks) may be used to provide a contrast metric in the neighborhood of elements. In addition, the contrast ratio may be weighted based on the average value. Specifically, the contrast ratio of a given block or macroblock is expressed as the sum of the differences between the mean of the current block normalized by the mean of all blocks in the neighborhood and the mean of the respective neighboring blocks (8 neighboring blocks in a 3 × 3 neighborhood of 9 blocks). This metric provides effective granularity into spatial texture information that translates into spatial activity and is successfully used in block segmentation algorithms for variable block size DCT (also known as ABSDCT).

e. Motion vector field: in a predicted frame (e.g., a P or B frame), the motion vectors of the macroblocks (and of the sub-blocks) of the predicted frame may be mapped to form a motion vector field. These fields are used for motion vector processing to smooth out outlier (outlier) motion vectors and generally indicate (i) the overall motion in the sequence; (ii) motion activity of various objects in the frames (e.g., based on the intensity, density, and/or magnitude of the motion vector field); and (iii) the number of moving objects in the frame. The level of motion activity also provides an indication of the spatial activity of a particular sequence, since objects (requiring detection of boundaries) and variability (requiring detection of differences between regions) need to be detected by motion activity processing in that sequence.

f. Edge detection: edge detection algorithms in image processing typically apply a high pass filter, such as a Sobel filter, to pixels in an image over a particular window (e.g., a 3 x 3 or 5 x 5 region), and then compare the filtered output to a threshold to determine the presence of an edge. The mapping of the detected edges and the number of edges provides an indication of spatial activity.

g. There are other various indications of spatial activity, as known to those skilled in the art of image processing, and any such metric may be applied to the process shown in fig. 2.

Temporal activity

In step 212, temporal activity is determined by the content classification module 106. The amount of motion in a video sequence determines the amount of temporal correlation and redundancy in frames of the video sequence that can be used to compress the video sequence. In one embodiment, the temporal activity is quantified by one of the following methods:

a. motion vector field: this metric uses the same approach as explained above in the description of spatial activity module 210 to construct a motion vector field for the interpolated frame, which is then analyzed.

b. Predicted frame size: the size of the predicted frame is an indication of its entropy, since for a predicted frame the predicted frame size depends on the number of bits needed to encode the predicted motion vectors and residuals. Generally, the greater the amount of motion (or temporal activity), the greater the entropy to be encoded in the predicted frame.

MPEG-7 descriptor: the MPEG-7 Motion Activity Descriptor (MAD) attempts to "capture" the human perception of "activity intensity" or "cadence" of a video sequence. For example, the moment of a goal score in a football game will be perceived by most human viewers as a sequence of "high events". In comparison, a speaker's "head-shoulder" sequence would of course be considered a ' low activity ' sequence by the same viewer. It has been found that MPEG-7 MADs can accurately capture the entire range of activity intensities in natural video. It uses the quantization standard deviation of motion vectors to classify video segments into five categories ranging from very low to very high intensity.

d. Motion activity: the motion activity descriptor is defined as the amount of motion in a video sequence for the problem of active content analysis, indexing, browsing, and querying for motion activity of video data, and is included as a descriptor in the MPEG-7 standard. The proposed technique attempts to automatically measure motion activity using an accumulation of quantized pixel differences between frames of a given video segment. The result is that the motion accumulated for each scene is represented as a two-dimensional matrix. Scalable techniques are also provided that compare these matrices and generate MADs that efficiently represent various motions of each scene. The degree (amount) and position of the movement is calculated and indicated.

All of the above spatial and temporal activity metrics are examples only. In other embodiments, any and all of these algorithms may be used together with simple threshold settings to evaluate and score the level of spatial and temporal activity within a frame and between frames.

Spatial-temporal activity

In step 214, the inter-frame correlation is determined by quantizing the absolute spatial activity between adjacent frames or between a group of frames such as a GOP, and the variance of the absolute spatial activity over the frames (variance) using simple frame differences and/or higher order statistics such as variance and kurtosis.

Alternatively, the contrast ratio principle may be extended to the time domain to provide a spatio-temporal activity metric.

Determining redundancy

The spatial activity metric determined in step 210 is combined with the temporal or motion activity metric determined in step 212 to determine the overall spatio-temporal activity for a given sequence. For example, in hybrid video compression, the first frame in a sequence (e.g., the first frame after any access point or scene cut) is typically encoded independently of any temporal prediction. This first frame is referred to as an I-frame. Subsequent frames in the sequence are predicted primarily from the I-frame or other previous frames, which as mentioned earlier are referred to as P or B frames. In one embodiment, the redundancy between a reference frame and a predicted frame in an original sequence of video may be determined using:

a. correlation: may perform (1) (i) one or more macroblocks; (ii) other basic units of a frame; or (iii) the binary correlation of pixels of the entire predicted frame relative to (2) co-located units in the reference frame. This processing is a computationally expensive operation, but is also an accurate estimate of redundancy.

b. Motion vector: the magnitude and correlation of motion vectors in the neighborhood of macroblocks and in the entire frame are compared between the reference frame and the predicted frame. Motion vector smoothing or other motion vector processing may then be applied to determine motion vector variance, or to classify motion fields based on activity.

c. Importance: each macroblock or a window of macroblocks is then rated based on a low, medium or high level of redundancy. Encoding the low redundancy block into a B frame using bi-directional prediction, the medium redundancy block providing the decoder with one or more of the following information: the motion vectors are provided to refine the motion vector processing results in the decoder FRUC, the residual information is provided to refine the differences in texture, the luminance shift information is provided in the form of a DC offset, etc. High redundancy blocks are those blocks that are sufficiently related to the corresponding blocks in the FRUC interpolated frame that they are skipped.

All the above pieces of information relate to a macroblock or a 3 × 3 macroblock window, and are referred to as side information.

Selecting frames for FRUC

Having determined the amount of redundancy in the original video, the classification is then based on the content. In one embodiment, a variety of sampled raw data is utilized to establish classification parameters for a particular application, and as such, the encoding system 100 may be adjusted to adapt to the particular content desired to be supported by the implementation. The classification mechanism uses the predicted frame size in normal hybrid coding. In one embodiment, the smaller the size of the predicted frame and the larger the redundancy factor, the greater the likelihood of skipping the interpolated frame during the encoding process. So that the frames are not included in the transmitted video sequence but are subsequently up-converted during the decoding/FRUC process.

As shown in fig. 3, applying the above mechanism to one or two encodings of low complexity is useful for applications such as mobile camera applications; where the delay and processor power limit the device to complete the second encoding after fully or partially encoding the first. However, when encoder complexity is not a concern, as in an internet or wireless multimedia server implementation, as shown in fig. 4, normal hybrid encoding may be performed in a first pass, and then spatial, temporal, spatio-temporal activity may be determined in a second pass. In one embodiment, based on the predicted frame size (e.g., P and B frame sizes in a video sequence) and frame characteristics (e.g., the ratio of bits used for motion vectors to bits used for coefficient data) of a video sequence, low cost frames (e.g., low transmission cost frames) may be dropped at encoder 104 and reconstructed at decoder 154 using decoded reference frames. In another embodiment, the encoder 104 may encode and transmit the smaller information entropy to the decoder 154 to "assist" the up-conversion process of the decoder 154 as described below. Although the purpose of this assistance is primarily to enhance the quality of the reconstructed video, it may also be used to reduce the computational load of the decoder 154 by assisting the FRUC engine 158 of the decoder 154 in making the correct decisions during the mode decision process.

Entropy coding of information between original and FRUC interpolated frames

As mentioned herein, one of the main advantages of EA-FRUC is that the original frame corresponding to the interpolated frame is available at the encoder. Therefore, FRUC decisions may be guided to minimize the error between the original and reconstructed frames. For example, the proposed FRUC processing approach described herein relies on motion vector processing, content identification and assignment. In these processes, interpolation of occlusion (occlusion) and overlap regions is a challenge. However, using entropy encoding by the entropy determination module 108 in fig. 1, these regions may be identified and appropriate boundary information transmitted to the decoder 154 to assist in the FRUC process. Other applications of such entropy Coding are Scalable Video Coding applications of FRUC processing, as described in co-pending patent application No.11/173,121 entitled "Method and Apparatus for Using Frame Rate up conversion Techniques in Scalable Video Coding". In one embodiment, the entropy determination module 108 may use the following entropy coding metric:

1. pixel difference data: for transmission, the pixel residuals between the reconstructed FRUC frame and the original frame are transformed, quantized, and entropy encoded. The method is simple. However, any residual from the encoding process contains higher energy and is not well compressed.

2. Threshold: the threshold is based on activity (spatial and temporal) metrics, or on human visual system masking and sensitivity, rather than SAD. The Human Visual System (HVS) is an empirical model that accounts for the sensitivity of the human eye to various visual effects such as color, brightness, contrast, etc., and SAD is known to minimize errors in terms of mean square error rather than reduce errors in terms of visual quality.

3. Motion vector: the corrected motion vector data for the area having a large difference from the original frame is encoded and transmitted. This motion vector is estimated in a causal and non-causal manner using the original frame and the reconstructed reference frame. Causal coding is predictive coding that utilizes information available at the encoding/decoding instant (e.g., information from a previous macroblock according to decoding order), while non-causal coding is interpolation coding that utilizes interpolation information (e.g., information from a next macroblock).

B frame coding: in co-pending patent application No. _______ [040442], entitled "Method and Apparatus for Using Frame Rate up Conversion technical in Scalable Video Coding", the use of FRUC-interpolated frames as one of the reference frames during B-Frame prediction is described. This approach may provide a 30% reduction in transmitted texture data on average.

5. Based on the mode: the above B-frame encoding method illustrates the use of an interpolated frame as a reference frame in encoding a B-frame. The decision to use the interpolated frame may be based on rate (i.e., minimize bitrate for a given distortion), distortion (i.e., minimize distortion for a given target bitrate), and/or quality (i.e., maximize the amount of quality metric used to measure perceptual quality for a given bitrate based on HVS or mean square error).

Once the entropy of the information to be encoded has been determined, the data may be encoded in one embodiment using conventional Huffman-like variable length coding or arithmetic coding. Furthermore, for data such as Laplacian distribution of residuals, Golomb-Rice or Exp-Golomb coding may be applied.

Bit stream generation

Video coding standards define a bitstream that is decoded by a video decoder compatible with any standard. However, any encoding technique may be used as long as the encoded bitstream is compatible with a standard-compliant decoder in the reconstruction process, and the operation of the encoder is "open" in this regard. In scalable applications, the decoder performance is unknown, and standard-compliant bitstreams need to be generated by the encoder, performance targeted and optimized for proper decoding. In one embodiment, the bitstream generation module 112 in the encoding system 100 controls the operation of the standard compatible bitstream generator 114, the standard incompatible bitstream generator 116, and the proprietary bitstream generator 118, the operation of each of which is described below.

Classes (profiles) and levels (levels) are defined in video coding standards because they provide a large set of tools for coding audiovisual objects, and in order to be able to implement the standard efficiently, a subset of the set of tools is determined for a particular application. These subsets are called "classes," which limit the number of tools in the set of tools that the decoder must implement. Furthermore, for each of these classes, one or more complexity levels are set that define the computational complexity.

Standard and class compatibility

These video decoders conform to a particular class and level in order for standard-compliant decoders in receivers to decode transmitted streams, such as streams transmitted in wireless multimedia communications. While FRUC algorithms are presented as an appendix in various standards, they are not typically part of a class of standards. Therefore, it is desirable to implement EA-FRUC without having to modify the bitstream syntax and/or semantics.

To comply with existing standards, the encoding system 100 uses the syntax of a compatible standard (compatible class) that can be used to transmit "auxiliary" information. In one embodiment, the standard-compliant generator 114 may use the standard syntax in the following means of implementing EA-FRUC processing:

a.B-frame syntax: when a B frame is not received, side information is transmitted only through general B frame coding, either because the B frame is a part of an enhancement layer and only a base layer is received, or because most of redundant macroblocks are skip mode macroblocks and the entire B frame is not transmitted.

b. Redundant slice (slice) or picture: h.264 provides such a syntax. In which case a frame is redundant, it is not necessary to transmit an entire slice or frame. Portions of slices (a few important macroblocks) or portions of frames (a few slices determined to be important) are sent using this syntax. This feature is part of all the classes defined in h.264.

c. Supplemental Enhancement Information (SEI): the specific fields of the SEI are part of the class of h.264, which can be used to transmit "auxiliary" information.

Standard compatibility, class incompatibility

The set of tools of many video coding standards includes syntax and semantics for carrying proprietary data that is incompatible with the classes defined in the standard (i.e., the tools provide information as opposed to standardized references). The interpretation of the parsed private data may be specific to the destination device, and this particular feature may be utilized in a closed communication system to improve performance. In one application of this feature, in one embodiment of the invention, the non-standard compliant bitstream generator 116 utilizes the non-standard compliant private data to provide "aiding" information for FRUC. In a closed-loop system, the use of private information provides more flexibility in the transmission of "auxiliary" information, since the decoder modifications required with the private data are minimal and can be achieved by simple "insertions" or "additions":

a. dedicated SEI field: these fields are not part of the class in h.264 and can be used to transmit "assistance" information.

b. User data: MPEG-2 and MPEG-4 provide syntax for carrying private data that can be used to transmit "auxiliary" information.

Specificity of specificity

The provision of a proprietary codec that is not standard compliant in the proprietary bitstream generator 118 increases the flexibility of the EA-FRUC approach provided herein. In particular, any or all video compression techniques (e.g., based on transforms such as DCT, integer, hadamard, wavelet, object, optical flow, or morphing) may employ general algorithms of video interpolation to achieve the bit rate reduction and compression efficiency improvements described above with respect to EA-FRUC. Advantages of using a proprietary codec include: the proprietary nature of bitstream generator 118 provides an extremely flexible platform to utilize all FRUC and EA-FRUC algorithms. A substantial reduction in the bit stream overhead introduced by each standard (e.g., macroblock headers in h.264 account for 25% of the total bit rate) may be mitigated and/or eliminated. Source-channel joint coding is now also possible, which has great advantages for multimedia communication over error-prone channels. For example, proprietary methods that take advantage of joint source and transmission channel probabilities, distributions, and characteristics provide the coding system 100 with the ability to prioritize particular streams and add the required parameters and data to gracefully recover from errors.

FRUC and EA-FRUC for error concealment

The increasing popularity of wireless multimedia requires that transmitted video have some resilience to errors and that intelligent video decoders can conceal bit, packet and burst errors. Video compression removes redundancy and increases information entropy in the compressed stream. Ironically, however, the removal of redundant information and the increase in entropy is so significant-almost reaching the loss of a single bit, byte, or packet of data can affect the quality of the reconstructed video; from losing one block to losing many macroblocks or slices, it runs through the current entire GOP until the next I-frame or Instantaneous Decoding Refresh (IDR) frame is properly accepted. IDR is an h.264 term that stands for instantaneous decoding refresh. An IDR picture is an absolute refresh point (access unit) in the bitstream, which makes prediction information not needed before the access unit for decoding the IDR picture. In applications such as video telephony, video conferencing, and video mail, the probability of error has serious consequences and error concealment is crucial. Errors can also affect latency in conversational applications. Fortunately, frame, slice, macroblock, and block interpolation and interpolation assistance algorithms (e.g., motion vector assignment and motion vector processing) as provided in various forms of FRUC may be used for error concealment.

Fig. 5 shows a block diagram of an access terminal 502x and an access point 504x where the decoder 154 and encoder 104 may each be located in an application of the EA-FRUC system 100 in a wireless system. For the reverse link, at access terminal 502x, a Transmit (TX) data processor 514 receives traffic data from a data buffer 512, processes (e.g., encodes, interleaves, and symbol maps) each data packet based on a selected coding and modulation scheme, and provides data symbols. The data symbols are modulation symbols for data and the pilot symbols are modulation symbols for pilot (which are known a priori). A modulator 516 receives the data symbols, pilot symbols, and possibly signaling for the reverse link, performs (e.g., OFDM) modulation and/or other processing as specified by the system, and provides a stream of output chips. A transmitter unit (TMTR)518 processes (e.g., converts to analog, filters, amplifies, and frequency upconverts) the output chip stream and generates a modulated signal, which is transmitted by an antenna 520.

At access point 504x, an antenna 552 receives the modulated signals transmitted by access terminal 502x and other terminals in communication with access point 504 x. A receiver unit (RCVR)554 processes (e.g., conditions and digitizes) the received signal from antenna 552 and provides received samples. A demodulator (Demod)556 processes (e.g., demodulates and detects) the received samples and provides detected data symbols, which are noisy estimate of the data symbols transmitted by the terminals to access point 504 x. A Receive (RX) data processor 558 processes (e.g., symbol demaps, deinterleaves, and decodes) the detected data symbols for each terminal and provides decoded data for that terminal.

For the forward link, at access point 504x, traffic data is processed by a TX data processor 560 to generate data symbols. A modulator 562 receives the data symbols, pilot symbols, and signaling for the forward link, performs (e.g., OFDM) modulation and/or other pertinent processing, and provides an output chip stream, which is further conditioned by a transmitter unit 564 and transmitted from antenna 552. The forward link signaling may include power control commands generated by controller 570 for all terminals transmitting on the reverse link to access point 504 x. At access point 502x, the modulated signal transmitted by access point 504x is received by an antenna 520, conditioned and digitized by a receiver unit 522, and processed by a demodulator 524 to obtain detected data symbols. An RX data processor 1026 processes the detected data symbols and provides decoded data and forward link signaling for the terminal. Controller 530 receives the power control commands and controls the transmission of data and transmit power on the reverse link to access point 504 x. Controllers 530 and 570 direct the operation of access terminal 502x and access point 504x, respectively. Memory units 532 and 572 store program codes and data used by controllers 530 and 570, respectively.

As discussed herein, an "access terminal" refers to a device that provides voice and/or data connectivity to a user. The access terminal may be connected to a computing device such as a laptop computer or desktop computer, or it may be a self-contained device such as a personal digital assistant. An access terminal may also be called a subscriber unit, mobile station, mobile, remote station, remote terminal, user agent, or user equipment. An access terminal may be a subscriber station, wireless device, cellular telephone, PCS telephone, cordless telephone, a Session Initiation Protocol (SIP) phone, a Wireless Local Loop (WLL) station, a Personal Digital Assistant (PDA), a handheld device having wireless connection capability, or other processing device connected to a wireless modem.

As discussed herein, an "access point" refers to a device in an access network that communicates over the air-interface, through one or more sectors, with access terminals. The access point acts as a router between the access terminal and the rest of the access network, which may include an IP network, and converts received air-interface frames into IP packets. The access point also coordinates management of the air interface attributes.

The disclosed embodiments may be applied to any one of the following technologies and combinations thereof: code Division Multiple Access (CDMA) systems, multi-carrier CDMA (MC-CDMA), wideband CDMA (W-CDMA), High Speed Downlink Packet Access (HSDPA), Time Division Multiple Access (TDMA) systems, Frequency Division Multiple Access (FDMA) systems, and Orthogonal Frequency Division Multiple Access (OFDMA) systems.

It should be noted that the methods described herein may be implemented on a variety of communication hardware, processors, and systems known to those of ordinary skill in the art. For example, the general requirements of a client to perform the operations described herein are: the client would have a display to display content and information, a processor to control the operation of the client, and a memory to store data and programs related to the operation of the client. In one embodiment, the client is a cellular telephone. In another embodiment, the client is a handheld computer with communication capabilities. In yet another embodiment, the client is a personal computer with communication capabilities.

The various illustrative logics, logical blocks, modules, and circuits described in connection with the embodiments disclosed herein may be implemented or performed with a general purpose processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other programmable logic device, discrete gate or transistor logic, discrete hardware components, or any combination thereof designed to perform the functions described herein. A general purpose processor may be a microprocessor, but in the alternative, the processor may be any conventional processor, controller, microcontroller, or state machine. A processor may also be implemented as a combination of computing devices, e.g., a combination of a DSP and a microprocessor, a plurality of microprocessors, one or more microprocessors in conjunction with a DSP core, or any other such configuration.

The steps of a method or algorithm described in connection with the embodiments disclosed herein may be embodied directly in hardware, in a software module executed by a processor, or in a combination of the two. A software module may reside in RAM memory, flash memory, ROM memory, EPROM memory, EEPROM memory, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art. An exemplary storage medium is coupled to the processor such the processor can read information from, and write information to, the storage medium. In the alternative, the storage medium may be integral to the processor. The processor and the storage medium may reside in an ASIC. The ASIC may reside in a user terminal. In the alternative, the processor and the storage medium may reside as discrete components in a user terminal.

The description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the various embodiments. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments, e.g., in an instant messaging service or any general wireless data communication applications, without departing from the spirit or scope of the invention. Thus, the description is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein. The word "exemplary" is used exclusively herein to mean "serving as an example, instance, or illustration. Any embodiment described herein as "exemplary" is not necessarily to be construed as preferred or advantageous over other embodiments.

Claims

1. A method for constructing a video sequence comprising a sequence of frames, comprising:

determining an amount of a type of activity in the sequence of frames, the type of activity being selected from the group consisting of spatial activity, temporal activity, and spatio-temporal activity;

determining a redundancy in the activity; and

encoding the frame if the determined redundancy is below a predetermined threshold.

2. The method of claim 1, wherein determining the spatial activity in the sequence of frames comprises: an amount of texture information in at least one frame of the sequence of frames is determined.

3. The method of claim 1, wherein determining the temporal activity in the sequence of frames comprises: an amount of temporal correlation and redundancy between at least two frames in the sequence of frames is determined.

4. The method of claim 1, wherein determining the spatio-temporal activity in frames of the sequence of frames comprises: an amount of temporal correlation and redundancy of an amount of texture information between at least two frames in the sequence of frames is determined.

5. The method of claim 1, wherein determining the redundancy in the activity comprises: the redundancy is determined using at least one spatial activity metric selected from the group consisting of a contrast ratio metric, a spatial complexity metric, and a variance metric.

6. The method of claim 1, wherein determining the redundancy in the activity comprises: the redundancy is determined using at least one temporal activity metric selected from the group consisting of a motion field strength metric, a temporal complexity metric, a sum of absolute differences metric.

7. The method of claim 1, wherein determining the redundancy in the activity comprises: determining redundancy by comparing at least two activity metrics selected from the group consisting of: correlation of spatial activity metrics between multiple adjacent frames, directionality metrics, joint behavior between regions with different spatial activity metrics, motion field strength metrics, temporal complexity metrics, and sum of absolute difference metrics.

8. A method for determining difference information between two frames, comprising: determining one difference metric selected from the group consisting of a pixel difference metric, a motion information difference metric, a mode decision threshold metric, and an interpolated frame refinement metric, wherein said determining of said difference metric is performed using a frame rate up-conversion process.

9. A method for encoding difference information, comprising: the at least one technique is specified in a video coding standard using at least one technique selected from the group consisting of a motion compensation process, a motion vector transformation process, a motion vector quantization process, and an entropy coding process, wherein a standard-compliant processor is capable of processing the difference information in conjunction with a frame rate up-conversion process to produce a video frame.

10. A method for processing a video bitstream, the video bitstream having difference information contained therein, the method comprising:

encoding difference information in the video bitstream using an entropy encoding technique selected from the group consisting of a variable length coding technique, a Huffman coding technique, and an arithmetic coding technique; and

the encoded information is carried in a user data syntax specified in the video coding standard.

11. The method of claim 10, further comprising: a standard-compliant video bitstream is generated.

12. A method for processing a video bitstream having encoded difference information therein, the encoded difference information being stored in a user data syntax, the method comprising:

extracting the encoded difference information from the user data syntax;

decoding the difference information; and

the decoded difference information is used in a frame rate up-conversion process to generate a video frame.

13. A computer-readable medium having stored thereon instructions for causing a computer to perform a method of constructing a video sequence comprising a sequence of frames, the method comprising:

determining a redundancy in the activity; and

14. The computer-readable medium of claim 13, wherein determining the spatial activity in the sequence of frames comprises: an amount of texture information in at least one frame of the sequence of frames is determined.

15. The computer-readable medium of claim 13, wherein determining the temporal activity in the sequence of frames comprises: an amount of temporal correlation and redundancy between at least two frames of the sequence of frames is determined.

16. The computer-readable medium of claim 13, wherein determining the spatio-temporal activity in frames of the sequence of frames comprises: an amount of temporal correlation and redundancy of an amount of texture information between at least two frames in the sequence of frames is determined.

17. The computer-readable medium of claim 13, wherein determining the redundancy in the activity comprises: the redundancy is determined using at least one spatial activity metric selected from the group consisting of a contrast ratio metric, a spatial complexity metric, and a variance metric.

18. The computer-readable medium of claim 13, wherein determining the redundancy in the activity comprises: the redundancy is determined using at least one temporal activity metric selected from the group consisting of a motion field strength metric, a temporal complexity metric, a sum of absolute differences metric.

19. The computer-readable medium of claim 13, wherein determining the redundancy in the activity comprises: determining redundancy by comparing at least two activity metrics selected from the group consisting of: correlation of spatial activity metrics between multiple adjacent frames, directionality metrics, joint behavior between regions with different spatial activity metrics, motion field strength metrics, temporal complexity metrics, and sum of absolute difference metrics.

20. A computer-readable medium having instructions stored thereon for causing a computer to perform a method for determining difference information between two frames, the method comprising: determining a difference metric selected from the group consisting of a pixel difference metric, a motion information difference metric, a mode decision threshold metric, and an interpolated frame refinement metric, wherein said determination of said difference metric is made using a frame rate up-conversion process.

21. A computer-readable medium having instructions stored thereon for causing a computer to perform a method for encoding difference information, the method comprising: the at least one technique is specified in a video coding standard using at least one technique selected from the group consisting of a motion compensation process, a motion vector transform process, a motion vector quantization process, and an entropy coding process, wherein a standard-compliant processor is capable of processing the difference information in conjunction with a frame rate up-conversion process to produce a video frame.

22. A computer-readable medium having instructions stored thereon for causing a computer to perform a method of processing a video bitstream, the video bitstream having difference information contained therein, the method comprising:

23. The computer-readable medium of claim 22, further comprising: a standard-compliant video bitstream is generated.

24. A computer-readable medium having stored thereon instructions for causing a computer to perform a method for processing a video bitstream having encoded difference information therein, the encoded difference information being stored in a user data syntax, the method comprising:

extracting the encoded difference information from the user data syntax; and

decoding the difference information; and

video frames are generated using a frame rate up-conversion process.

25. An apparatus for constructing a video sequence comprising a sequence of frames, comprising:

means for determining an amount of a type of activity in the sequence of frames, the type of activity being selected from the group consisting of spatial activity, temporal activity, and spatio-temporal activity;

means for determining redundancy in the activity; and

means for encoding the frame if the determined redundancy is below a predetermined threshold.

26. The apparatus of claim 25, wherein means for determining the spatial activity in the sequence of frames comprises: means for determining an amount of texture information in at least one frame of the sequence of frames.

27. The apparatus of claim 25, wherein means for determining the temporal activity in the sequence of frames comprises: means for determining an amount of temporal correlation and redundancy between at least two frames in the sequence of frames.

28. The apparatus of claim 25, wherein the means for determining spatio-temporal activity in frames of the sequence of frames comprises: means for determining an amount of temporal correlation and redundancy of an amount of texture information between at least two frames in the sequence of frames.

29. The apparatus of claim 25, wherein the means for determining the redundancy in the activity comprises: means for determining redundancy using at least one spatial activity metric selected from the group consisting of a contrast ratio metric, a spatial complexity metric, and a variance metric.

30. The apparatus of claim 25, wherein the means for determining the redundancy in the activity comprises: a module for determining redundancy using at least one temporal activity metric selected from the group consisting of a motion field strength metric, a temporal complexity metric, a sum of absolute differences metric.

31. The apparatus of claim 25, wherein the means for determining the redundancy in the activity comprises: a module for determining redundancy by comparing at least two activity metrics selected from the group consisting of: correlation of spatial activity metrics between multiple adjacent frames, directionality metrics, joint behavior between regions with different spatial activity metrics, motion field strength metrics, temporal complexity metrics, and sum of absolute difference metrics.

32. An apparatus for determining difference information between two frames, comprising: means for determining a disparity metric selected from the group consisting of a pixel disparity metric, a motion information disparity metric, a mode decision threshold metric, and an interpolated frame refinement metric, wherein the determination of the disparity metric is performed using a frame rate up-conversion process.

33. An apparatus for encoding difference information, comprising: the at least one technique is specified in a video coding standard using a module of at least one technique selected from the group consisting of a motion compensation process, a motion vector transformation process, a motion vector quantization process, and an entropy coding process, wherein a standard-compliant processor is capable of processing the difference information in conjunction with a frame rate up-conversion process to produce a video frame.

34. An apparatus for processing a video bitstream, the video bitstream having difference information contained therein, comprising:

a module for encoding difference information in the video bitstream using an entropy encoding technique selected from the group consisting of a variable length coding technique, a Huffman coding technique, and an arithmetic coding technique; and

means for conveying the encoding information in a user data syntax specified in a video coding standard.

35. The apparatus of claim 34, further comprising: a module for generating a standard-compliant video bitstream.

36. An apparatus for processing a video bitstream having encoded difference information therein, the encoded difference information being stored in a user data syntax, comprising:

means for extracting the encoded difference information from the user data syntax;

means for decoding the difference information; and

means for generating a video frame using the decoded difference information in a frame rate up-conversion process.