HK1108309A

HK1108309A - Motion estimation techniques for video encoding

Info

Publication number: HK1108309A
Application number: HK08102272.7A
Authority: HK
Inventors: 沙拉特‧曼朱纳特; 李向川; 纳伦德拉纳特‧马拉亚特
Original assignee: 高通股份有限公司
Priority date: 2004-12-08
Filing date: 2005-12-07
Publication date: 2008-05-02

Abstract

This disclosure describes video encoding techniques and video encoding devices that implement such techniques. In one embodiment, this disclosure describes a video encoding device comprising a motion estimator that computes a motion vector predictor based on motion vectors previously calculated for video blocks in proximity to a current video block to be encoded, and uses the motion vector predictor in searching for a prediction video block used to encode the current video block, and a motion compensator that generates a difference block indicative of differences between the current video block to be encoded and the prediction video block.

Description

Motion estimation techniques for video coding

Technical Field

This disclosure relates to digital video processing, and more particularly, to video sequence encoding.

Background

Digital video capabilities can be incorporated into a wide variety of devices, including digital televisions, digital direct broadcast systems, wireless communication devices, Personal Digital Assistants (PDAs), laptop computers, desktop computers, digital cameras, digital recording devices, cellular or satellite radio telephones, and the like. Digital video devices can provide significant improvements over conventional analog video systems in creating, modifying, transmitting, storing, recording, and playing full motion video sequences.

Many different video coding standards are currently established for encoding digital video sequences. For example, the Motion Picture Experts Group (MPEG) has developed a number of standards, including MPEG-1, MPEG-2, and MPEG-4. Other standards include the QuickTime developed by the International Telecommunications Union (ITU) H.263 Standard, Apple Computer of Cupertino California^TMTechnology, developed by Microsoft Corporation of Redmond, Washington for Windows^TMVideo of (g), Indeo developed by Intel Corporation^TMRealVideo from RealNetworks, Inc. of Seattle, Washington^TMAnd Cinepak developed by SuperMac, Inc^TM. New standards are continually emerging and evolving, including the ITU h.264 standard and many proprietary standards.

Various video coding standards achieve improved video sequence transmission rates by encoding data in compressed form. Compression can reduce the overall amount of data that needs to be transmitted for effective transmission of video frames. For example, most video coding standards utilize graphics and video compression techniques designed to facilitate transmission over narrower bandwidths of video and images, which bandwidths are narrower than achievable without compression.

For example, the MPEG standard and ITU h.263 and ITU h.264 standards support video coding techniques that utilize similarities between successive video frames, referred to as temporal or inter-frame correlation, to provide inter-frame compression. Inter-frame compression techniques exploit data redundancy across frames by converting a pixel-based representation of a video frame into a motion representation. In addition, certain video coding techniques may take advantage of similarities within frames (referred to as spatial or intra-frame correlation) to further compress video frames.

To support compression, a digital video device includes an encoder for compressing a digital video sequence and a decoder for decompressing the digital video sequence. In many cases, the encoder and decoder form an integrated encoder/decoder (CODEC) that operates on blocks of pixels within multiple frames that define a sequence of video images. For example, in the MPEG-4 standard, an encoder typically divides a video frame to be transmitted into video blocks referred to as "macroblocks," which may comprise a 16 x 16 array of pixels. The ITU h.264 standard supports 16 × 16 video blocks, 16 × 8 video blocks, 8 × 16 video blocks, 8 × 8 video blocks, 8 × 4 video blocks, 4 × 8 video blocks, and 4 × 4 video blocks.

For each video block in a video frame, the encoder searches similarly sized video blocks of one or more immediately preceding video frames (or subsequent frames) to identify the most similar video block (referred to as the "best prediction"). The process of comparing the current video block with the video blocks of other frames is commonly referred to as motion estimation. Once the "best prediction" is identified for a video block, the encoder may encode the difference between the current video block and the best prediction. This process of encoding the difference between the current video block and the best prediction includes a process known as motion compensation. Motion compensation includes a process that creates a difference block that indicates the difference between the current video block to be encoded and the best prediction. Motion compensation generally refers to the act of taking the best prediction block using a motion vector and then subtracting the best prediction from the input block to produce a difference block.

After motion compensation has created the difference block, a series of additional encoding steps are typically performed to encode the difference block. These additional encoding steps may depend on the encoding standard being used. For example, in an MPEG4 compliant encoder, additional encoding steps may include an 8 x 8 discrete cosine transform, followed by scalar quantization, followed by raster to zig-zag reordering, followed by run length encoding, followed by Huffman encoding. The encoded difference block may be transmitted along with a motion vector that indicates which video block in the previous frame was used for encoding. A decoder receives the motion vector and encoded difference block, and decodes the received information to reconstruct a video sequence.

It is highly desirable to simplify and improve the encoding process. For this purpose, various coding techniques have been developed. Because motion estimation is one of the most computationally intensive processes in video coding, improvements to motion estimation may provide significant improvements in the video coding process.

Disclosure of Invention

This disclosure describes a number of motion estimation techniques that may improve video coding. In particular, this disclosure proposes various non-conventional uses of Motion Vector Predictors (MVPs), which are early estimates of the desired motion vector, and are typically computed based on previously computed motion vectors for neighboring video blocks. In some techniques, this disclosure proposes to use motion vector predictors to compute distortion measures, which quantify the cost of the motion vectors relative to other motion vectors. In other techniques, motion vector predictors may be used to define a search for a prediction video block used to encode a current video block. Various other techniques are also described, such as techniques that use searches in stages at different spatial resolutions, which may speed up the encoding process without significantly degrading performance.

In one embodiment, this disclosure describes a method comprising calculating a motion vector predictor based on a motion vector previously calculated for a video block proximate to a current video block to be encoded, and searching a prediction video block used to encode the current video block using the motion vector predictor.

In another embodiment, this disclosure describes a method comprising identifying a motion vector for a prediction video block used to encode a current video block, the identifying comprising calculating a distortion measure that depends at least in part on an amount of data associated with a different motion vector, and the method also comprises generating a difference block that indicates differences between the current video block to be encoded and the prediction video block.

These and other techniques described herein may be implemented in a digital video device in hardware, software, firmware, or any combination thereof. If implemented in software, the techniques may be directed to a computer readable medium comprising program code that, when executed, performs one or more of the encoding techniques described herein. Additional details of various embodiments are set forth in the accompanying drawings and the description below. Other features, objects, and advantages will be apparent from the description and drawings, and from the claims.

Drawings

Fig. 1 is a block diagram illustrating an example system in which a source digital video device transmits an encoded sequence of video data to a receiving digital video device.

Fig. 2 is an exemplary block diagram of a digital video device according to an embodiment of this disclosure.

Fig. 3 and 4 are block diagrams of exemplary motion estimators that may be used in the digital video device illustrated in fig. 2.

FIG. 5 is a diagram illustrating a technique consistent with this disclosure, in which searches are performed in stages at different spatial resolutions, according to an embodiment of the present disclosure.

Detailed Description

This disclosure describes motion estimation techniques that may be used to improve video coding. While the techniques are generally described in the context of an overall motion estimation process, it is understood that one or more of the techniques may be used separately in various scenarios. In various aspects, this disclosure proposes many non-conventional uses of Motion Vector Predictors (MVPs), which are early estimates of the desired motion vector. The MVP is typically calculated based on motion vectors previously calculated for neighboring video blocks, e.g., as the median of the motion vectors of neighboring video blocks that have been recorded. However, other mathematical functions may alternatively be used to calculate the MVP, such as an average of the motion vectors of neighboring video blocks or possibly more complex mathematical functions.

In one embodiment, the present invention proposes to use MVP to calculate the distortion measure. The distortion measure quantifies a cost of the motion vector relative to other motion vectors. Thus, although conventional techniques identify a prediction video block based only on differences between the current video block and the prediction video block (e.g., the best prediction for the current video block to be encoded), this disclosure recognizes that the motion vector itself may have a variable bit length. Thus, in accordance with this disclosure, the described motion estimation techniques may account for the cost of the motion vector itself via distortion measures in addition to the difference between the current video block and the predicted video block. A mathematical function may be defined for the distortion measure, where the MVP includes a variable of the mathematical function defined for the distortion measure.

This disclosure also proposes using MVP to define a search for a predictive video block. For example, even if the preliminary search does not identify the location corresponding to the MVP as a likely candidate for the best predicted video block, a later search may still be performed in the location corresponding to the MVP, as such locations often yield the best prediction. In particular, the search may be performed in stages at different spatial resolutions, and in this case, the search at or around the MVP may be performed at the best spatial resolution, regardless of whether a previous search has identified such locations associated with the MVP. As described in more detail below, these and other techniques may enable significant improvements in video coding, particularly in small handheld devices where processing power is limited and power consumption is critical.

Fig. 1 is a block diagram illustrating an example system 10 in which source device 12 transmits an encoded sequence of video data to receive device 14 over communication link 15. Both source device 12 and sink device 14 are digital video devices. In particular, source device 12 encodes video data compliant with a video standard such as the MPEG-4 standard, the ITU H.263 standard, the ITU H.264 standard, or any of a variety of other standards that utilize motion estimation in video coding. One or both of devices 12, 14 of system 10 implement motion estimation techniques (as described in more detail below) in order to improve the video encoding process.

The communication link 15 may comprise a wireless link, a physical transmission line, an optical fiber, a packet-based network (e.g., a local area network, a wide area network, or a global network such as the internet), a Public Switched Telephone Network (PSTN), or any other communication link capable of transferring data. Thus, communication link 15 represents any suitable communication medium or possibly a collection of different networks and links to transmit video data from source device 12 to sink device 14.

Source device 12 may be any digital video device capable of encoding and transmitting video data. Source device 12 may include a video memory 16 to store digital video sequences, a video encoder 18 to encode the sequences, and a transmitter 20 to transmit the encoded sequences to source device 14 via communication link 15. Video encoder 18 may comprise, for example, various hardware, software, or firmware, or one or more Digital Signal Processors (DSPs) executing programmable software modules to control video encoding techniques, as described herein. Associated memory and logic circuitry may be provided to support DSP-controlled video encoding techniques. As will be described, video encoder 18 may be configured to calculate a Motion Vector Predictor (MVP) and use the MVP in an unconventional manner.

Conventionally, various coding standards specify motion vector transmission to reduce the bandwidth required to transmit a video sequence. However, according to some standards, instead of sending motion vectors, the difference between the motion vectors and the Motion Vector Predictor (MVP) is transmitted to obtain even better compression. Thus, conventionally, MVPs are computed such that differences between motion vectors and MVPs may be transmitted to reduce bandwidth relative to motion vector transmission. Furthermore, this may improve compression, as the difference between the motion vector and the MVP may typically be encoded with a smaller number of bits than the motion vector itself.

The present invention recognizes various additional uses of MVP. As one example, MVP may be used to compute a distortion measure that quantifies the cost of the motion vectors themselves. The following provides a specific mathematical function of distortion measure that quantifies the cost of the motion vector itself, using MVP as a variable of the mathematical function.

As another example, MVP may be used to define multiple searches that may improve the process of identifying a prediction video block (e.g., the best prediction for a given video block being encoded). In particular, multiple searches may be defined at or around the MVP location, which is particularly useful when searches are performed at different spatial resolutions. For example, a search at or around the MVP location may be performed in the search phase even if the previous search did not identify the location of the MVP as a likely location of a good candidate video block for motion estimation.

Source device 12 may also include a video capture device 23, such as a camera, to capture video sequences and store the captured sequences in memory 16. In particular, video capture device 23 may comprise a Charge Coupled Device (CCD), a charge injection device, a photodiode array, a Complementary Metal Oxide Semiconductor (CMOS) device, or any other photosensitive device capable of capturing video images or digital video sequences.

As other examples, video capture device 23 may be a video converter that converts analog video data to digital video data, such as from a television, a video cassette recorder, a camcorder, or another video device. In certain embodiments, source device 12 may be configured to transmit real-time video sequences over communication link 15. In this case, receiving device 14 may receive the real-time video sequence and display the video sequence to the user. Alternatively, source device 12 may capture and encode video sequences that are sent as video data files (i.e., not in real-time) to receive device 14. Thus, source device 12 and sink device 14 may support applications such as video clip playback, video mail, or video conferencing, for example, in a mobile wireless network. Devices 12 and 14 may include various other elements not specifically illustrated in fig. 1.

Receiving device 14 may take the form of any digital video device capable of receiving and decoding video data. For example, receiving device 14 may include a receiver 22 to receive an encoded digital video sequence from transmitter 20, e.g., via an intermediate link, a router, other network equipment, etc. Receiving device 14 may also include a video decoder 24 for decoding the sequence and a display device 26 to display the sequence to a user. However, in some embodiments, the receiving device 14 may not include an integrated display device 14. In such cases, receiving device 14 may act as a receiver that decodes the received video data to drive a discrete display device, such as a television or monitor.

Exemplary devices of source device 12 and receive device 14 include servers, workstations, or other desktop computing devices located on a computer network, and mobile computing devices such as laptops or Personal Digital Assistants (PDAs). Other examples include digital television broadcast satellites and receiving devices such as digital televisions, digital cameras, digital video cameras or other digital recording devices, digital video telephones (such as mobile telephones having video capabilities), direct two-way communication devices having video capabilities, other wireless video devices, and so forth.

In some cases, source device 12 and receive device 14 each include a coder/decoder (CODEC) (not shown) for encoding and decoding digital video data. In particular, both source device 12 and receive device 14 may include a transmitter and receiver as well as memory and a display. Various ones of the encoding techniques outlined below are described in the context of a digital video device that includes an encoder. However, it is understood that the encoder may form part of the CODEC. In this case, the CODEC may be implemented within hardware, software, firmware, DSP, microprocessor, Application Specific Integrated Circuit (ASIC), Field Programmable Gate Array (FPGA), discrete hardware components, or various combinations thereof.

Video encoder 18 within source device 12 operates on blocks of pixels within a sequence of video frames to encode the video data. For example, video encoder 18 may implement motion estimation and motion compensation techniques in which a video frame to be transmitted is divided into a plurality of blocks of pixels (referred to as video blocks). For purposes of illustration, the video blocks may comprise blocks of any size, and may vary within a given video sequence. For example, the ITU h.264 standard supports 16 × 16 video blocks, 16 × 8 video blocks, 8 × 16 video blocks, 8 × 8 video blocks, 8 × 4 video blocks, 4 × 8 video blocks, and 4 × 4 video blocks. The use of smaller video blocks in video encoding may result in better resolution in encoding, and may be particularly useful for video frame locations that include higher levels of detail. Furthermore, video encoder 18 may be designed to operate on 4 x 4 video blocks and reconstruct larger video blocks from the 4 x 4 video blocks, if desired.

Each pixel in a video block may be represented by an n-bit value (e.g., 8 bits) that defines the visual characteristics of the pixel, such as color and intensity, represented in terms of chrominance and luminance values. However, motion estimation is typically performed only on the luminance component, since human vision is more sensitive to luminance changes than chrominance changes. Thus, for motion estimation purposes, the entire n-bit value may quantize the luminance of a given pixel. However, the principles of this disclosure are not limited to the format of pixels, and may be extended for use with simpler less-bit pixel formats or more complex larger-bit pixel formats.

For each video block in a video frame, video encoder 18 of source device 12 performs motion estimation by searching video blocks stored in memory 16 for one or more already transmitted previous video frames (or subsequent video frames) to identify similar video blocks, referred to as predictive video blocks. In some cases, the prediction video block may comprise the "best prediction" from a previous or subsequent video frame, although this disclosure is not limited in this respect. Video encoder 18 performs motion compensation to create a difference block that indicates the difference between the current video block to be encoded and the best prediction. Motion compensation generally refers to the act of using a motion vector to obtain the best prediction block and then subtracting the best prediction from the input block to produce a difference block.

After the motion compensation process has created the difference block, a series of additional encoding steps are typically performed to encode the difference block. These additional encoding steps may depend on the encoding standard being used. For example, in an MPEG-4 compliant encoder, additional encoding steps may include an 8 x 8 discrete cosine transform, followed by scalar quantization, followed by raster-to-zig line reordering, followed by run-length encoding, followed by Huffman encoding.

Once encoded, the encoded difference block may be transmitted with a motion vector that identifies a video block from a previous frame (or a subsequent frame) used for encoding. In this way, instead of encoding each frame as an independent image, video encoder 18 encodes the differences between adjacent frames. Such techniques may significantly reduce the amount of data required to accurately represent each frame of a video sequence.

The motion vector may define the pixel location relative to the upper left corner of the video block being encoded, although other formats for the motion vector may be used. In any case, by encoding video blocks using motion vectors, the bandwidth required to transmit a video data stream may be significantly reduced.

In some cases, video encoder 18 may support inter-coding in addition to intra-coding. Intra-frame coding exploits similarities within frames (referred to as spatial or intra-frame correlation) to further compress video frames. Intra-frame compression is typically based on texture coding, such as Discrete Cosine Transform (DCT) coding, used to compress still images. Intra-frame compression is typically used in conjunction with inter-frame compression, but may also be used as an alternative in certain implementations.

Receiver 22 of receive device 14 may receive encoded video data in the form of motion vectors and encoded difference blocks that indicate the encoded difference between the video block being encoded and the best prediction used in motion estimation. However, in some cases, instead of sending the motion vector, the difference between the motion vector and the MVP is transmitted. In any case, decoder 24 may perform video decoding in order to generate a video sequence for display to a user via display device 26. Decoder 24 of receive device 14 may also be implemented as an encoder/decoder (CODEC). In this case, both source device 12 and receive device 14 may be capable of encoding, transmitting, receiving, and decoding digital video sequences.

In accordance with this disclosure, video encoder 18 calculates an MVP for the current video block to be encoded, but uses the MVP in one or more non-conventional manners. For example, MVP may be used to help address the cost of the motion vector itself via computing a distortion measure of the quantization cost. Furthermore, MVP may be used to define or adjust a search for the best prediction video block.

Fig. 2 is an exemplary block diagram of device 30, which device 30 may correspond to source device 12. In general, device 30 comprises a digital video device capable of performing motion estimation and motion compensation techniques for inter-frame video coding.

As shown in fig. 2, device 30 includes a video encoder 32 to encode video sequences and a video memory 34 to store video sequences before and after encoding. Device 30 may also include a transmitter 36 to transmit the encoded sequences to another device, and possibly a video capture device 38, such as a camera, to capture video sequences and store the captured sequences in memory 34. The various elements of device 30 may be communicatively coupled via a communication bus 35. Various other elements, such as intra-encoder elements, various filters, or other elements may also be included in device 30, but are not described in detail for purposes of simplicity.

Video memory 34 typically includes a relatively large amount of storage space. For example, video memory 34 may comprise Dynamic Random Access Memory (DRAM) or FLASH memory. In other examples, video memory 34 may comprise non-volatile memory or any other data storage device.

Video encoder 32 may form part of a device capable of performing video encoding. As one particular example, video encoder 32 may comprise a chipset for a radiotelephone, including hardware, software, firmware, and/or some combination of processors or Digital Signal Processors (DSPs). Video encoder 32 includes local memory 37, which local memory 37 may comprise a smaller and faster storage space relative to video memory 34. For example, the local memory 37 may comprise a Synchronous Random Access Memory (SRAM). Local memory 37 may comprise an "on-chip" memory that is integrated with other components of video encoder 32 to provide very fast data access in a processor-intensive encoding process. During encoding of a given video frame, the current video block to be encoded may be loaded from video memory 34 to local memory 37. The search space for locating the best prediction may also be loaded from video memory 34 to local memory 37.

The search space may comprise a subset of pixels of one or more of the previous video frames (or subsequent frames). The selected subset may be pre-identified as a possible location for identifying the best prediction that closely matches the current video block to be encoded. Furthermore, if different search phases are used, the search space may vary in the motion estimation process. In this case, the search space may be tapered in search space size, with these later searches performed at a resolution greater than the previous searches.

The local memory 37 is loaded with the current video block to be encoded and a search space that includes some or all of the one or more different video frames used for inter-coding. Motion estimator 40 compares the current video block to various video blocks in the search space in order to identify the best prediction. However, in some cases, a sufficient match for the encoding may be more quickly identified without specifically checking each possible candidate, and in such cases, a sufficient match may not actually be the "best" prediction, albeit sufficient for efficient video encoding. In general, the phrase "predictive video block" refers to a sufficient match, which may be the best prediction.

Motion estimator 40 performs a comparison between the current video block to be encoded and a candidate video block in the search space of memory 37. In some cases, the candidate video block may include non-integer pixel values generated for fractional interpolation. For example, motion estimator 40 may perform Sum of Absolute Difference (SAD) techniques, Sum of Squared Difference (SSD) techniques, or other comparison techniques, as desired. SAD techniques involve the task of performing absolute difference calculations between the pixel values of the current video block to be encoded and the pixel values of the candidate video block with which the current video block is being compared. The results of these absolute difference calculations are summed (i.e., accumulated) to define a difference value indicative of the difference between the current video block and the candidate video block. For an 8 x 8 pixel image block, 64 differences may be calculated and summed, and for a 16 x 16 pixel macroblock, 256 differences may be calculated and summed. The entire sum of all computations may define the difference value of the candidate video block.

A lower difference value generally indicates that the candidate video block is a better match and is therefore a better candidate for motion estimation coding than other candidate video blocks that produce higher difference values (i.e., increased distortion). In some cases, the calculation may be terminated when the accumulated difference exceeds a defined threshold or when a sufficient match is identified early, even if other candidate video blocks have not been considered.

SSD techniques also involve the task of performing difference computations between pixel values of a current video block to be encoded and pixel values of candidate video blocks. However, in SSD techniques, the difference calculation is squared and then the squared values are summed (i.e., accumulated) in order to define a difference value that indicates the difference between the current video block and the candidate video block (which is being compared to the current macroblock). Alternatively, motion estimator 40 may use other comparison techniques, such as Mean Square Error (MSE), normalized cross-correlation function (NCCF), or another suitable comparison algorithm.

Finally, the motion estimator may identify the "best prediction," which is the candidate video block that most closely matches the video block to be encoded. However, it is understood that in many cases, a sufficient match may be located before the best prediction, and in those cases, encoding may be performed using a sufficient match. Furthermore, predicting a video block refers to a sufficient match, which may be the best prediction.

In addition to identifying the predictive video block, the motion estimator 40 generates a Motion Vector Predictor (MVP). Some video coding standards utilize MVP to further compress motion vector transmissions. In those cases, instead of transmitting motion vectors, the standard may require that differences between motion vectors and MVPs be transmitted to further improve compression. However, according to this disclosure, additional techniques using MVP are identified, which may even further improve video coding.

In particular, the present invention proposes many unconventional uses of MVP. The MVP is typically calculated based on motion vectors previously calculated for neighboring video blocks, e.g., as the median of the motion vectors of neighboring video blocks that have been recorded, the average of the motion vectors of neighboring video blocks, or another mathematical calculation based on motion vectors of video blocks that are in close proximity to the current video block to be encoded.

In one example, the distortion measure is calculated using MVP. In particular, MVP may be a variable of a mathematical function that quantifies distortion measures. The distortion measure quantifies the cost of a motion vector relative to other motion vectors. Thus, although conventional techniques identify a prediction video block based only on differences between the current video block and the prediction video block (e.g., the best prediction for the current video block to be encoded), this disclosure recognizes that the motion vector itself may have a variable bit length. Thus, in accordance with this disclosure, the described motion estimation techniques may account for the cost of the motion vector itself via distortion measures in addition to the difference between the current video block and the predicted video block. The distortion measure depends at least in part on the amount of data associated with the motion vectors, and thus the distortion measure may be used to distinguish the motion vectors in terms of the amount of data associated with them.

This disclosure also proposes using MVP to define a search of predictive video blocks. For example, even if the primary search does not identify the location corresponding to the MVP as a likely candidate for the best prediction video block, a later search may still be performed in the location corresponding to (or near) the MVP, as such locations typically produce the best prediction. In particular, the searches may be performed in stages at different spatial resolutions, and in that case, searches around the MVP may be performed at the best spatial resolution, regardless of whether a previous search has identified such locations associated with the MVP.

Once motion estimator 40 identifies the best prediction for a video block, motion compensator 42 creates a difference block that indicates the difference between the current video block and the best prediction. Video block encoder 44 may further encode the difference block to compress the difference block, and the encoded difference block may be forwarded for transmission to another device along with a motion vector (or difference between a motion vector and an MVP) used to identify which candidate video block in the search space has been used for encoding. For simplicity, the additional components for performing encoding after motion compensation are generalized to difference block encoder 44, as the particular components will vary depending on the particular standard supported. In other words, the difference block encoder 44 may perform one or more conventional encoding techniques on the difference block, the generation of which is as described herein.

Motion estimation is sometimes referred to as the most critical part of video coding. For example, motion estimation typically requires a larger amount of computational resources compared to any other process of video encoding. For this reason, it is highly desirable to perform motion estimation in a manner that can reduce computational complexity and also help improve compression ratio. The motion estimation techniques described herein may achieve these goals by using a search scheme that performs searches at multiple spatial resolutions, thereby reducing computational complexity without losing any accuracy. In addition, a cost function (distortion measure) is proposed, which includes the cost of encoding the motion vectors. Motion estimator 40 may also use multiple candidate locations of the search space to improve the accuracy of video encoding, and the search area around the multiple candidates may be programmable so that the process may scale according to frame rate and picture size. Finally, motion estimator 40 may also combine the cost functions of multiple smaller square blocks (e.g., 4 x 4 blocks) to obtain the cost of various larger block shapes (e.g., 4 x 8 blocks, 8 x 4 blocks, 8 x 8 blocks, 8 x 16 blocks, 16 x 8 blocks, 16 x 16 blocks, etc.).

For many operations and calculations, a Motion Vector Predictor (MVP) is used to add a cost factor to a motion vector that deviates from the motion vector predictor. MVP may also provide additional initial motion vectors that may be used to define the search, particularly in the high resolution stage of a multi-stage search.

Fig. 3 is a block diagram of an exemplary motion estimator 40A, which motion estimator 40A may correspond to motion estimator 40 of fig. 2. In general, motion estimator 40 may be implemented as hardware, software, firmware, one or more processors, or a Digital Signal Processor (DSP), or any combination thereof. In the example of fig. 3, the motion estimator 40A comprises software modules 51, 52, 53 implemented on the DSP. As shown, the motion estimator 40A includes an MVP calculation module 51 that calculates MVPs. For example, MVP calculation module 51 may calculate MVP as the median of two or more motion vectors previously calculated for video blocks proximate to the current video block to be encoded. As a more detailed example, the MVP calculation module 51 may calculate MVP as: a zero value if no motion vector is available for a video block that is close to the current video block; when only one previously calculated video block is available, calculating a value of a motion vector of one previously calculated video block that is close to the current video block; when only two previously calculated video blocks are available, calculating as a value based on a median of the two previously calculated video blocks that are close to the current video block; or when three previously calculated video blocks are available, as a value based on the median of the three previously calculated video blocks that are close to the current video block.

Motion estimator 40A also includes a search module 52. Search module 52 generally performs a search to compare the current video block to be encoded with various candidate video blocks in the search space, e.g., stored in local memory 37 (fig. 2). In some cases, multiple searches may be performed at increasing levels of resolution.

The motion estimator 40A also includes a distortion measure calculation module 53 to generate a distortion measure, as outlined herein. For example, distortion measure calculation module 53 may use MVP to generate distortion measures that quantify the costs associated with different motion vectors. The distortion measure calculation module 53 may also be programmed to assign weighting factors to the distortion measures, the weighting factors defining the relative importance of the number of bits required to encode the different motion vectors. This may allow scaling based on the frame rate or frame size of the sequence to be encoded. The distortion measure quantifies the number of bits needed to encode different motion vectors in order to facilitate such scalability.

Fig. 4 is another block diagram of an exemplary motion estimator 40B, which motion estimator 40B may correspond to motion estimator 40 of fig. 2. The motion estimator 40 of fig. 4 may be very similar to the motion estimator 40A of fig. 3. For example, motion estimator 40B may include MVP calculation module 61 to calculate MVP (as described herein) and distortion measure calculation module 63 to generate a distortion measure (as outlined herein). However, motion estimator 40B of fig. 4 performs a search in stages at different spatial resolutions to identify a motion vector for a prediction video block used to encode the current video block. In this example, motion estimator 40B includes search stages 1(65), 2(66), and 3(67), which respectively perform searches in three stages with different spatial resolutions. The search stage 1(65) may perform searches at low resolution over a relatively large search space, e.g., every fourth pixel. Search stage 2(66) may use the results of the first search to define a smaller search space around the area in the first search space that produces good results, and perform additional searches at intermediate resolutions, e.g., searching every other pixel. Search stage 3(67) may use the results of the second search to define an even smaller search space around the area in the second search space that yields good results, and perform additional searches at high resolution, e.g., at every pixel or possibly fractional pixel resolution. Furthermore, in some cases, MVP may be used to define the search in search stage 3(67), regardless of whether stage 2 or 1 has identified regions around MVP as likely candidates for good coding.

Referring again more generally to fig. 2, motion estimator 40 may provide motion vectors for the two upper neighboring macroblocks, and may also indicate the number of motion vectors (i.e., 0, 1, or 2). In general, motion estimator 40 may access the values of the motion vectors of the immediate left neighboring macroblock and the macroblock above the current block, as these motion vectors may have been previously calculated. In contrast, the motion vectors of the directly right neighboring macroblocks and the motion vectors of the macroblocks below the current block are typically unusable. However, if the calculations are performed in different directions, the motion vectors that can be used may be different.

In the case of integer motion estimation, the motion estimator 40 has an integer value for the motion vector of the left macroblock, and it uses a motion vector having a 16 × 16 block shape. In the case of fractional motion estimation, the motion estimator 40 uses a fractional value for the motion vector of the right 16 × 8 block or the top 8 × 16 block or the top right 8 × 8 block or the motion vector of the 16 × 16 block (depending on which block shape is being searched for fractional motion estimation).

The following procedure can be used to calculate MVP (motion vector predictor). In this example, MVP is calculated from the motion vectors of three neighboring macroblocks.

If no neighboring motion vectors are available, then MVP is 0

If one neighboring motion vector is available, MVP — one available MV

If two neighboring motion vectors are available, then MVP is median (2 MV, 0)

If all three neighboring motion vectors are available, then MVP is the median (3 MV)

Fig. 5 is a diagram illustrating a three-stage approach to motion estimation. The areas 71A and 71B correspond to theoretical maximum search areas. Regions 73A, 73B, 73C, and 73D may comprise the actual desired search area, and regions 75A, 75B, 75C, and 75D may comprise a search point grid. Stages 1, 2, and 3 are labeled MVP calculation 79 in fig. 5, which MVP calculation 79 may correspond to one of the MVP calculation modules described above. Specific examples of embodiments are described below with reference to the description of FIG. 5, and are not intended to limit the scope of the present invention.

For example, in stage 1 of fig. 5, a full or exhaustive search of the best motion vector for the largest block shape 16 x 16 may be performed in the 1/4 domain (undersampling at 4 for each direction). This implies that the actual undersampled block size is 4 x 4. Since the search is thorough, no start point or start candidate is needed at this stage.

The search range determines the search area, i.e., the photometric (1uma) sample area in the selected reference frame. It may be desirable to use a search range of 32 in all samples in any direction. This makes the search area a square of size 64+16 to 80 samples for a maximum block size 16 x 16. The search range in the under-sampling domain is thus 17 × 17 (8 in each direction).

In the first stage (stage 1), the search area may correspond to a square of size 20 samples, due to undersampling. The samples defining the search area may be obtained by sub-sampling the stored square of size 80 (i.e., by sensing every fourth sample of every fourth row line).

The following equation may be used to calculate the distortion measure D for stage 1. This distortion measure is calculated for each motion vector candidate MV and minimized for all candidates in stage 1.

img id="idf0001" file="A20058004200400201.GIF" wi="470" he="42" img-content="drawing" img-format="GIF"/

Wherein s is_ij、p_ijRespectively, of the current input block and the prediction block obtained from the search region in the 1/4 under-sampling domain. MV is { MV ═ MV_x，MV_yAnd defines 1/4 a current motion vector candidate in the downsampled domain. λ is a motion vector cost factor that can be tuned or programmed to obtain the desired rate-distortion performance. Thus, by programming λ, a motion estimator can be defined at a particular rate or frame size with performance objectives in mind. MVP ═ MVP_x，MVP_yIs the motion vector predictor.

Obtaining the best motion vector MV after minimizing the above metric before entering stage 2^*＝{MV_x ^*，MV_y ^*And converted as follows:

img id="idf0002" file="A20058004200400202.GIF" wi="224" he="25" img-content="drawing" img-format="GIF"/

wherein, MV^IIs an input to stage 2, U_IIs an offset (communicated from the motion estimator) equal to 0 or 1.

In phase 2, a search of the range 8 × 8 (3 to +4 in each direction) is again performed on the largest block shape 16 × 16 in the 1/2 (2 under-sampled in each direction) domain. This implies that the actual downsampled block size is 8 x 8. Furthermore, around the optimal motion vector for stage one (i.e., at the MV)^IAbove) performs the stage 2 search. Multiple searches may also be performed in stage 2, for example, if two or more sufficient motion vectors are identified in stage 1. In stage 2, the search area may be a square of size 15 (8 × 8 search range of 8 × 8 blocks). The samples defining the search area may be obtained by sub-sampling the stored square of size 80 (i.e. by reading out every second sample of every second line).

The distortion measure D for stage 2 can then be calculated using the following equation. The distortion measure is again calculated for each motion vector candidate MV and minimized across all candidates for stage 2.

img id="idf0003" file="A20058004200400203.GIF" wi="470" he="43" img-content="drawing" img-format="GIF"/

Wherein s is_ij、p_ijSamples for the current input block and the prediction block obtained from the search region in the 1/2 under-sampling domain, respectively, MV ═ MV_x，MV_yIs 1/2 the current motion vector candidate in the under-sampled domain.

In stage 3, the best motion vector MV from stage two, obtained after minimizing the above metric, is filtered^**＝{MV_x ^**，MV_y ^**The following conversion is performed:

img id="idf0004" file="A20058004200400204.GIF" wi="244" he="25" img-content="drawing" img-format="GIF"/

wherein, MV^IIIs the input of the next stage, U_IIIs an offset equal to 0 or 1. Again, however, multiple searches may also be performed in stage 3, e.g., if two or more sufficient motion vectors are identified in stage 2.

In stage 3, a search may be performed around two starting motion vectors, where one starting motion vector is the best motion vector for stage two (i.e., at MV)^IIAbove) (its search and calculation is described above) and the other starting motion vector is MVP- { U_III，U_IIIIn which U_IIIIs an offset equal to 0 or 1 passed from the motion estimator). In other words, MVP is used to define the search in stage 3, regardless of whether regions of the search space have been identified during stage 1 or 2. Specifically, the search may be defined at or around the MVP in stage 3, regardless of whether regions of the search space have been identified during stage 1 or 2.

In stage 3, a search may be performed in the integer resolution domain of normal sampling. Thus, the maximum block size is 16 × 16, which corresponds to a block shape of 16 × 16. During stage 3, motion estimator 40 (fig. 2) may also calculate and track distortion metrics and optimal motion vectors for different shaped blocks (e.g., 16 x 8 blocks, 8 x 16 blocks, 8 x 8 blocks, etc.). In one example, motion estimator 40 tracks 9 motion vectors and 9 distortion metrics during stage 3.

The search range may be 4 x 4(-2 to +1) or 8 x 8(-3 to +4) around any of the starting motion vectors, which is programmable. The entire search area (i.e., a square of size 80) may be available in local memory, and these locally stored samples may be searched directly if there are no subsamples.

The distortion measure D for all blocks of each block shape may then be calculated using the following equation, and these are the quantities calculated for each motion vector candidate MV and minimized for all candidates.

img id="idf0005" file="A20058004200400211.GIF" wi="221" he="42" img-content="drawing" img-format="GIF"/

img id="idf0006" file="A20058004200400212.GIF" wi="219" he="43" img-content="drawing" img-format="GIF"/

img id="idf0007" file="A20058004200400213.GIF" wi="222" he="42" img-content="drawing" img-format="GIF"/

img id="idf0008" file="A20058004200400214.GIF" wi="221" he="42" img-content="drawing" img-format="GIF"/

D_MV8x8，0＝SAD_8x8，0+2λ+2^λ+2(|MV_x-MVP_x|+|MV_y-MVP_y|)

D_MV8x8，1＝SAD_8x8，1+2^λ+2(|MV_x-MVP_x|+|MV_y-MVP_y|)

D_MV8x8，2＝SAD_8x8，2+2^λ+2(|MV_x-MVP_x|+|MV_y-MVP_y|)

D_MV8x8，3＝SAD_8x8，3+2^λ+2(|MV_x-MVP_x|+|MV_y-MVP_y|)

D_MV8x16，0＝SAD_8x8，0+SAD_8x8，1+2^λ+2(|MV_x-MVP_x|+|MV_y-MVP_y|)

D_MV8x16，1＝SAD_8x8，2+SAD_8x8，3+2^λ+2(|MV_x-MVP_x|+|MV_y-MVP_y|)

D_MV16x8，0＝SAD_8x8，0+SAD_8x8，2+2^λ+2(|MV_x-MVP_x|+|MV_y-MVP_y|)

D_MV16x8，1＝SAD_8x8，1+SAD_8x8，3+2^λ+2(|MV_x-MVP_x|+|MV_y-MVP_y|)

D_MV16x16＝SAD_8x8，0+SAD_8x8，1+SAD_8x8，2+SAD_8x8，3+2^λ+2(|MV_x-MVP_x|+|MV_y-MVP_y|)

Wherein s is_ij、p_ijSamples of a current input block and a prediction block obtained from a search region, respectively, MV ═ MV_x，MV_yIs 1/2 the current motion vector candidate in the under-sampled domain.

Many different embodiments have been described. The techniques may be able to improve video coding by improving motion estimation. The techniques may be implemented in hardware, software, firmware, or any combination thereof. If implemented in software, the techniques may be directed to a computer readable medium comprising program code that, when executed in a device that encodes video sequences, performs one or more of the methods mentioned above. In this case, the computer-readable medium may include Random Access Memory (RAM), such as Synchronous Dynamic Random Access Memory (SDRAM), read-only memory (ROM), non-volatile random access memory (NVRAM), electrically erasable programmable read-only memory (EEPROM), FLASH memory, and the like.

The program code may be stored on the memory in the form of computer readable instructions. In this case, a processor (e.g., a DSP) may execute instructions stored in memory in order to perform one or more of the techniques described herein. In some cases, the techniques may be performed by a DSP that invokes various hardware components (e.g., a motion estimator) to accelerate the encoding process. In other cases, the video encoder may be implemented as a microprocessor, one or more Application Specific Integrated Circuits (ASICs), one or more Field Programmable Gate Arrays (FPGAs), or some other hardware-software combination. These and other embodiments are within the scope of the following claims.

Claims

1. A video encoding device, comprising:

a motion estimator that calculates a motion vector predictor based on a motion vector previously calculated for a video block close to a current video block to be encoded, and uses the motion vector predictor to search a predictive video block used to encode the current video block; and

a motion compensator that generates a difference block that indicates differences between the current video block to be encoded and the prediction video block.

2. The video encoding device of claim 1, wherein the motion estimator uses the motion vector predictor to generate a distortion measure that quantifies a cost associated with a different motion vector.

3. The video encoding device of claim 2, wherein the motion estimator is programmable to assign weighting factors to the distortion measures, the weighting factors defining the relative importance of the number of bits required to encode different motion vectors.

4. The video encoding device of claim 1, wherein the motion estimator computes the motion vector predictor as a median of two or more motion vectors previously computed for the video block proximate to the current video block.

5. The video encoding device of claim 1, wherein the motion estimator computes the motion vector predictor as:

a zero value if no motion vector is available for the video block that is close to the current video block;

when only one previously calculated video block is available, is the value of the motion vector of one previously calculated video block that is close to the current video block;

when only two previously calculated video blocks are available, a value based on a median of the two previously calculated video blocks that is close to the current video block; and

when three previously calculated video blocks are available, is a value based on the median of the three previously calculated video blocks that are close to the current video block.

6. The video encoding device of claim 1, wherein the motion estimator performs searches in stages at different spatial resolutions to identify the motion vector to the predictive video block used to encode the current video block.

7. The video encoding device of claim 6, wherein the motion estimator performs a search in at least three stages having different spatial resolutions.

8. The video encoding device of claim 6, wherein the motion vector predictor defines a search in at least one of the stages.

9. The video encoding device of claim 1, wherein the predictive video block comprises a best prediction.

10. A video encoding device, comprising:

a motion estimator that identifies a motion vector to a predictive video block used to encode a current video block, the identifying comprising calculating a distortion measure that depends at least in part on an amount of data associated with a different motion vector; and

11. The video encoding device of claim 10, wherein the motion estimator is programmable to assign a weighting factor to the distortion measure, the weighting factor defining an importance of an amount of data associated with the different motion vector in identifying the motion vector to the predictive video block used to encode the current video block.

12. The video encoding device of claim 10, wherein the motion estimator performs searches in stages at different spatial resolutions to identify the motion vector to the predictive video block used to encode the current video block.

13. The video encoding device of claim 10, wherein the video encoding device calculates a motion vector predictor based on a motion vector previously calculated for a video block proximate to a current video block to be encoded, wherein the motion vector predictor defines a search in at least one of the stages and is also used to calculate the distortion measure.

14. A method, comprising:

calculating a motion vector predictor based on motion vectors previously calculated for video blocks proximate to a current video block to be encoded; and

searching for a prediction video block used to encode the current video block using the motion vector predictor.

15. The method of claim 14, further comprising generating a difference block indicating differences between the current video block to be encoded and the prediction video block.

16. The method of claim 14, further comprising identifying a motion vector to the predictive video block used to encode the current video block, the identifying comprising calculating a distortion measure that depends at least in part on the motion vector predictor.

17. The method of claim 16, wherein the distortion measure quantifies a number of bits required to encode different motion vectors.

18. The method of claim 14, further comprising calculating the motion vector predictor as a median of two or more motion vectors previously calculated for the video blocks proximate to the current video block.

19. The method of claim 14, further comprising calculating the motion vector predictor as:

a zero value if no motion vector is available for a video block that is close to the current video block;

when three previously calculated video blocks are available, is based on a value that is close to the median of the three previously calculated video blocks of the current video block.

20. The method of claim 14, further comprising performing searches in stages at different spatial resolutions to identify the motion vector to the predictive video block used to encode the current video block.

21. The method of claim 20, further comprising performing a search in at least three stages having different spatial resolutions.

22. The method of claim 20, wherein the motion vector predictor defines a search in at least one of the stages.

23. The method of claim 22, further comprising receiving an input to program a weighting factor to the distortion measure, the weighting factor defining an importance of an amount of data associated with the different motion vector in identifying the motion vector to the predictive video block used to encode the current video block.

24. A method, comprising:

identifying a motion vector to a predictive video block used to encode a current video block, the identifying comprising calculating a distortion measure that depends at least in part on an amount of data associated with a different motion vector; and

generating a difference block indicating differences between the current video block to be encoded and the prediction video block.

25. The method of claim 24, further comprising receiving an input to program a weighting factor to the distortion measure, the weighting factor defining an importance of an amount of data associated with the different motion vector in identifying the motion vector to the predictive video block used to encode the current video block.

26. The method of claim 24, further comprising performing searches in stages at different spatial resolutions to identify the motion vector to the predictive video block used to encode the current video block.

27. The method of claim 26, wherein the motion vector predictor is calculated based on a motion vector previously calculated for a video block proximate to a current video block to be encoded, and wherein the motion vector predictor defines a search in at least one of the stages and is also used to calculate the distortion measure.

28. A computer-readable medium comprising computer-readable instructions that when executed:

29. The computer-readable medium of claim 28, wherein the instructions calculate the motion vector predictor as a median of two or more motion vectors previously calculated for the video block proximate to the current video block.

30. The computer-readable medium of claim 28, wherein the instructions perform searches in stages at different spatial resolutions to identify the motion vector to the predictive video block used to encode the current video block, wherein the motion vector predictor defines a search in at least one of the stages.

31. The computer-readable medium of claim 28, wherein the instructions identify a motion vector to the predictive video block used to encode the current video block by calculating a distortion measure that depends at least in part on the motion vector predictor.

32. A computer-readable medium comprising computer-readable instructions that when executed:

33. The computer-readable medium of claim 32, wherein the instructions receive an input to program a weighting factor to the distortion measure, the weighting factor defining an importance of an amount of data associated with the different motion vector in identifying the motion vector to the predictive video block used to encode the current video block.

34. The computer-readable medium of claim 32, wherein the instructions perform searches in stages at different spatial resolutions to identify the motion vector to the predictive video block used to encode the current video block, wherein the motion vector predictor is calculated based on a motion vector previously calculated for a video block proximate to a current video block to be encoded, and wherein the motion vector predictor defines a search in at least one of the stages.

35. An apparatus, comprising:

means for calculating a motion vector predictor based on motion vectors previously calculated for video blocks proximate to a current video block to be encoded; and

means for searching a prediction video block used to encode the current video block using the motion vector predictor.

36. The apparatus of claim 35, wherein the apparatus comprises a digital signal processor, and the means for calculating and the means for identifying comprise software executing on the digital signal processor.

37. An apparatus, comprising:

means for identifying a motion vector to a predictive video block used to encode a current video block, comprising means for calculating a distortion measure that depends at least in part on an amount of data associated with a different motion vector; and

means for generating a difference block indicating differences between the current video block to be encoded and the prediction video block.

38. The apparatus of claim 37, wherein the apparatus comprises a digital signal processor, and the means for identifying and the means for generating comprise software executing on the digital signal processor.